First I'd like to say that I'm working my way up from python to more complicated code. I'm now on to Java and I'm extremely new. I understand that Java is really good at multithreading which is good because I'm using it to process terabytes of data.
The data input is simply input into an iterator and I have a class that encapsulates a run function that takes one line from the iterator, does some analysis, and then writes the analysis to a file. The only bit of info the threads have to share with each other is the name of the object they are writing to. Simple right? I just want each thread executing the run function simultaneously so we can iterate through the input data quickly. In python it would b e simple.
from multiprocessing import Pool
f = open('someoutput.csv','w');
def run(x):
f.write(analyze(x))
p = Pool(8);
p.map(run,iterator_of_input_data);
So in Java, I have my 10K lines of analysis code and can very easily iterate through my input passing it my run function which in turn calls on all my analysis code sending it to an output object.
public class cool {
...
public static void run(Input input,output) {
Analysis an = new Analysis(input,output);
}
public static void main(String args[]) throws Exception {
Iterator iterator = new Parser(File(input_file)).iterator();
File output = File(output_object);
while(iterator.hasNext(){
cool.run(iterator.next(),output);
}
}
}
All I want to do is get multiple threads taking the iterator objects and executing the run statement. Everything is independent. I keep looking at java multithreading stuff but its for talking over networks, sharing data etc. Is this is simple as I think it is? If someone can just point me in the right direction I would be happy to do the leg work.
thanks
A ExecutorService (ThreadPoolExecutor) would be the Java equivelant.
ExecutorService executorService =
new ThreadPoolExecutor(
maxThreads, // core thread pool size
maxThreads, // maximum thread pool size
1, // time to wait before resizing pool
TimeUnit.MINUTES,
new ArrayBlockingQueue<Runnable>(maxThreads, true),
new ThreadPoolExecutor.CallerRunsPolicy());
ConcurrentLinkedQueue<ResultObject> resultQueue;
while (iterator.hasNext()) {
executorService.execute(new MyJob(iterator.next(), resultQueue))
}
Implement your job as a Runnable.
class MyJob implements Runnable {
/* collect useful parameters in the constructor */
public MyJob(...) {
/* omitted */
}
public void run() {
/* job here, submit result to resultQueue */
}
}
The resultQueue is present to collect the result of your jobs.
See the java api documentation for detailed information.
Related
I think I'm having race conditions when running my multithreaded Java program.
It's a permutation algorithm, which I want to speed up by running multiple instances with different values. So I start the threads in Main class with:
Runnable[] mcl = new MCL[n1];
for (int thread_id = 0; thread_id < n1; thread_id ++)
{
mcl[thread_id] = new MCL(thread_id);
new Thread(mcl[thread_id]).start();
Thread.sleep(100);
}
And it runs those MCL classes instances.
Again, I think threads are accessing the same memory space of the MCL class variables, am I right? If so, how can I solve this?
I'm trying to make all variables arrays, where one of the dimensions is related to an Id of the thread, so that each thread writes on a different index. Is this a good solution?:
int[] foo = new foo[thread_id];
You can't just bolt on thread safety as an afterthought, it needs to be an integral part of your data flow design.
To start, research and learn the following topics:
1) Synchronized blocks, mutexes, and final variables. A good place to start: Tutorial. I also love Josh Bloch's Effective Java, which although a few years old has golden nuggets for writing correct Java programs.
2) Oracle's Concurrency Tutorial
3) Learn about Executors. You shouldn't have to manage threads directly except in the most extreme cases. See this tutorial
If you pass non thread safe objects between threads you're going to see unpredictable results. Unpredictable means assignments may never show up between different threads, or objects may be left in invalid states (especially if you've got multiple member fields that have data dependent on each other).
Without seeing the MCL class we can't give you any specific details on what's dangerous, but given the code sample you've posted I think you should take a step back and do some research. In the long run it will save you time to learn it the right way rather than troubleshoot an incorrect concurrency scheme.
If you want to keep the thread data separate store it as instance variables in the Runnables (initializing each Runnable before starting its thread). Don't keep a reference to it in an array, that's just inviting trouble.
You can use a CompletionService to get a computed value back for each task wrapped in a Future, so you don't wait for it to be calculated until you actually need the value. The difference between a CompletionService and an Executor, which the commentors are recommending, is that the CompletionService uses an Executor for executing tasks, but it makes it easier to get your data back out, see this answer.
Here's an example of using a CompletionService. I'm using Callable instead of Runnable because I want to get a result back:
public class CompletionServiceExample {
public static void main(String[] args) throws Exception {
ExecutorService executorService = Executors.newCachedThreadPool();
ExecutorCompletionService<BigInteger> service =
new ExecutorCompletionService<BigInteger>(executorService);
MyCallable task1 = new MyCallable(new BigInteger("3"));
MyCallable task2 = new MyCallable(new BigInteger("5"));
Future<BigInteger> future1 = service.submit(task1);
Future<BigInteger> future2 = service.submit(task2);
System.out.println("submitted tasks");
System.out.println("result1=" + future1.get() );
System.out.println("result2=" + future2.get());
executorService.shutdown();
}
}
class MyCallable implements Callable<BigInteger> {
private BigInteger b;
public MyCallable(BigInteger b) {
this.b = b;
}
public BigInteger call() throws Exception {
// do some number-crunching thing
Thread.sleep(b.multiply(new BigInteger("100")).longValue());
return b;
}
}
Alternatively you can use the take method to retrieve results as they get completed:
public class TakeExample {
public static void main(String[] args) throws Exception {
ExecutorService executorService = Executors.newCachedThreadPool();
ExecutorCompletionService<BigInteger> service = new
ExecutorCompletionService<BigInteger>(executorService);
MyCallable task1 = new MyCallable(new BigInteger("10"));
MyCallable task2 = new MyCallable(new BigInteger("5"));
MyCallable task3 = new MyCallable(new BigInteger("8"));
service.submit(task1);
service.submit(task2);
service.submit(task3);
Future<BigInteger> futureFirst = service.take();
System.out.println(futureFirst.get());
Future<BigInteger> futureSecond = service.take();
System.out.println(futureSecond.get());
Future<BigInteger> futureThird = service.take();
System.out.println(futureThird.get());
executorService.shutdown();
}
}
I have been trying to parallelize a portion of a method within my code (as shown in the Example class's function_to_parallelize(...) method). I have examined the executor framework and found that Futures & Callables can be used to create several worker threads that will ultimately return values. However, the online examples often shown with the executor framework are very simple and none of them appear to suffer my particular case of requiring methods in the class that contains that bit of code I'm trying to parallelize. As per one Stackoverflow thread, I've managed to write an external class that implements Callable called Solver that implements that method call() and set up the executor framework as shown in the method function_to_parallelize(...). Some of the computation that would occur in each worker thread requires methods *subroutine_A(...)* that operate on the data members of the Example class (and further, some of these subroutines make use of random numbers for various sampling functions).
My issue is while my program executes and produces results (sometimes accurate, sometimes not), every time I run it the results of the combined computation of the various worker threads is different. I figured it must be a shared memory problem, so I input into the Solver constructor copies of every data member of the Example class, including the utility that contained the Random rng. Further, I copied the subroutines that I require even directly into the Solver class (even though it's able to call those methods from Example without this). Why would I be getting different values each time? Is there something I need to implement, such as locking mechanisms or synchronization?
Alternatively, is there a simpler way to inject some parallelization into that method? Rewriting the "Example" class or drastically changing my class structuring is not an option as I need it in its current form for a variety of other aspects of my software/system.
Below is my code vignette (well, it's an incredibly abstracted/reduced form so as to show you basic structure and the target area, even if it's a bit longer than usual vignettes):
public class Tools{
Random rng;
public Tools(Random rng){
this.rng = rng;
}...
}
public class Solver implements Callable<Tuple>{
public Tools toolkit;
public Item W;
public Item v;
Item input;
double param;
public Solver(Item input, double param, Item W, Item v, Tools toolkit){
this.input = input;
this.param = param;
//...so on & so forth for rest of arguments
}
public Item call() throws Exception {
//does computation that utilizes the data members W, v
//and calls some methods housed in the "toolkit" object
}
public Item subroutine_A(Item in){....}
public Item subroutine_B(Item in){....}
}
public class Example{
private static final int NTHREDS = 4;
public Tools toolkit;
public Item W;
public Item v;
public Example(...,Tools toolkit...){
this.toolkit = toolkit; ...
}
public Item subroutine_A(Item in){
// some of its internal computation involves sampling & random # generation using
// a call to toolkit, which houses functions that use the initialize Random rng
...
}
public Item subroutine_B(Item in){....}
public void function_to_parallelize(Item input, double param,...){
ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
List<Future<Tuple>> list = new ArrayList<Future<Tuple>>();
while(some_stopping_condition){
// extract subset of input and feed into Solver constructor below
Callable<Tuple> worker = new Solver(input, param, W, v, toolkit);
Future<Tuple> submit = executor.submit(worker);
list.add(submit);
}
for(Future<Tuple> future : list){
try {
Item out = future.get();
// update W via some operation using "out" (like multiplying matrices for example)
}catch(InterruptedException e) {
e.printStackTrace();
}catch(ExecutionException e) {
e.printStackTrace();
}
}
executor.shutdown(); // properly terminate the threadpool
}
}
ADDENDUM: While flob's answer below did address a problem with my vignette/code (you should make sure that you are setting your code up to wait for all threads to catch up with .await()), the issue did not go away after I made this correction. It turns out that the problem lies in how Random works with threads. In essence, the threads are scheduled in various orders (via the OS/scheduler) and hence will not repeat the order in which they are executed every run of the program to ensure that a purely deterministic result is obtained. I examined the thread-safe version of Random (and used it to gain a bit more efficiency) but alas it does not allow you to set the seed. However, I highly recommend those who are looking to incorporate random computations within their thread workers to use this as the RNG for multi-threaded work.
The problem I see is you don't wait for all the tasks to finish before updating W and because of that some of the Callable instances will get the updated W instead of the one you were expecting
At this point W is updated even if not all tasks have finished
Blockquote
// update W via some operation using "out" (like multiplying matrices for example)
The tasks that are not finished will take the W updated above instead the one you expect
A quick solution (if you know how many Solver tasks you'll have) would be to use a CountDownLatch in order to see when all the tasks have finished:
public void function_to_parallelize(Item input, double param,...){
ExecutorService executor = Executors.newFixedThreadPool(NTHREDS);
List<Future<Tuple>> list = new ArrayList<Future<Tuple>>();
CountDownLatch latch = new CountDownLatch(<number_of_tasks_created_in_next_loop>);
while(some_stopping_condition){
// extract subset of input and feed into Solver constructor below
Callable<Tuple> worker = new Solver(input, param, W, v, toolkit,latch);
Future<Tuple> submit = executor.submit(worker);
list.add(submit);
}
latch.await();
for(Future<Tuple> future : list){
try {
Item out = future.get();
// update W via some operation using "out" (like multiplying matrices for example)
}catch(InterruptedException e) {
e.printStackTrace();
}catch(ExecutionException e) {
e.printStackTrace();
}
}
executor.shutdown(); // properly terminate the threadpool
}
then in the Solver class you have to decrement the latch when call method ends:
public Item call() throws Exception {
//does computation that utilizes the data members W, v
//and calls some methods housed in the "toolkit" object
latch.countDown();
}
So I have a method that starts five threads. I want to write a unit test just to check that the five threads have been started. How do I do that? Sample codes are much appreciated.
Instead of writing your own method to start threads, why not use an Executor, which can be injected into your class? Then you can easily test it by passing in a dummy Executor.
Edit: Here's a simple example of how your code could be structured:
public class ResultCalculator {
private final ExecutorService pool;
private final List<Future<Integer>> pendingResults;
public ResultCalculator(ExecutorService pool) {
this.pool = pool;
this.pendingResults = new ArrayList<Future<Integer>>();
}
public void startComputation() {
for (int i = 0; i < 5; i++) {
Future<Integer> future = pool.submit(new Robot(i));
pendingResults.add(future);
}
}
public int getFinalResult() throws ExecutionException {
int total = 0;
for (Future<Integer> robotResult : pendingResults) {
total += robotResult.get();
}
return total;
}
}
public class Robot implements Callable<Integer> {
private final int input;
public Robot(int input) {
this.input = input;
}
#Override
public Integer call() {
// Some very long calculation
Thread.sleep(10000);
return input * input;
}
}
And here's how you'd call it from your main():
public static void main(String args) throws Exception {
// Note that the number of threads is now specified here
ExecutorService pool = Executors.newFixedThreadPool(5);
ResultCalculator calc = new ResultCalculator(pool);
try {
calc.startComputation();
// Maybe do something while we're waiting
System.out.printf("Result is: %d\n", calc.getFinalResult());
} finally {
pool.shutdownNow();
}
}
And here's how you'd test it (assuming JUnit 4 and Mockito):
#Test
#SuppressWarnings("unchecked")
public void testStartComputationAddsRobotsToQueue() {
ExecutorService pool = mock(ExecutorService.class);
Future<Integer> future = mock(Future.class);
when(pool.submit(any(Callable.class)).thenReturn(future);
ResultCalculator calc = new ResultCalculator(pool);
calc.startComputation();
verify(pool, times(5)).submit(any(Callable.class));
}
Note that all this code is just a sketch which I have not tested or even tried to compile yet. But it should give you an idea of how the code can be structured.
Rather than saying you are going to "test the five threads have been started", it would be better to step back and think about what the five threads are actually supposed to do. Then test to make sure that that "something" is actually being done.
If you really just want to test that the threads have been started, there are a few things you could do. Are you keeping references to the threads somewhere? If so, you could retrieve the references, count them, and call isAlive() on each one (checking that it returns true).
I believe there is some method on some Java platform class which you can call to find how many threads are running, or to find all the threads which are running in a ThreadGroup, but you would have to search to find out what it is.
More thoughts in response to your comment
If your code is as simple as new Thread(runnable).start(), I wouldn't bother to test that the threads are actually starting. If you do so, you're basically just testing that the Java platform works (it does). If your code for initializing and starting the threads is more complicated, I would stub out the thread.start() part and make sure that the stub is called the desired number of times, with the correct arguments, etc.
Regardless of what you do about that, I would definitely test that the task is completed correctly when running in multithreaded mode. From personal experience, I can tell you that as soon as you start doing anything remotely complicated with threads, it is devilishly easy to get subtle bugs which only show up under certain conditions, and perhaps only occasionally. Dealing with the complexity of multithreaded code is a very slippery slope.
Because of that, if you can do it, I would highly recommend you do more than just simple unit testing. Do stress tests where you run your task with many threads, on a multicore machine, on very large data sets, and make sure all the answers are exactly as expected.
Also, although you are expecting a performance increase from using threads, I highly recommend that you benchmark your program with varying numbers of threads, to make sure that the desired performance increase is actually achieved. Depending on how your system is designed, it's possible to wind up with concurrency bottlenecks which may make your program hardly faster with threads than without. In some cases, it can even be slower!
I am stuck with this following problem.
Say, I have a request which has 1000 items, and I would like to utilize Java Executor to resolve this.
Here is the main method
public static void main(String[] args) {
//Assume that I have request object that contain arrayList of names
//and VectorList is container for each request result
ExecutorService threadExecutor = Executors.newFixedThreadPool(3);
Vector<Result> vectorList = new Vector<Result();
for (int i=0;i<request.size();i++) {
threadExecutor.execute(new QueryTask(request.get(i).getNames, vectorList)
}
threadExecutor.shutdown();
response.setResult(vectorList)
}
And here is the QueryTask class
public QueryTask() implements Runnable {
private String names;
private Vector<Result> vectorList;
public QueryTask(String names, Vector<Result> vectorList) {
this.names = names;
this.vectorList = vectorList;
}
public void run() {
// do something with names, for example, query database
Result result = process names;
//add result to vectorList
vectorList.add(result);
}
}
So, based on the example above, I want to make thread pool for each data I have in the request, run it simultaneously, and add result to VectorList.
And at the end of the process, I want to have all the result already in the Vector list.
I keep getting inconsistent result in the response.
For example, if I pass request with 10 names, I am getting back only 3 or 4, or sometimes nothing in the response.
I was expecting if I pass 10, then I will get 10 back.
Does anyone know whats causing the problem?
Any help will be appreciate it.
Thanks
The easy solution is to add a call to ExecutorService.awaitTermination()
public static void main(String[] args) {
//Assume that I have request object that contain arrayList of names
//and VectorList is container for each request result
ExecutorService threadExecutor = Executors.newFixedThreadPool(3);
Vector<Result> vectorList = new Vector<Result();
for (int i=0;i<request.size();i++) {
threadExecutor.execute(new QueryTask(request.get(i).getNames, vectorList)
}
threadExecutor.shutdown();
threadExecutor.awaitTermination(aReallyLongTime,TimeUnit.SECONDS);
response.setResult(vectorList)
}
You need to replace threadExecutor.shutdown(); with threadExecutor.awaitTermination();. After calling threadExecutor.shutdown(), you need to also call threadExecutor.awaitTermination(). The former is a nonblocking call that merely initiates a shutdown whereas the latter is a blocking call that actually waits for all tasks to finish. Since you are using the former, you are probably returning before all tasks have finished, which is why you don't always get back all of your results. The Java API isn't too clear, so someone filed a bug about this.
There are at least 2 issues here.
In your main, you shut down the ExecutorService, then try to get the results out right away. The executor service will execute your jobs asychronously, so there is a very good chance that all of your jobs are not done yet. When you call response.setResult(vectorList), vectorList is not fully populated.
2. You are concurrently accessing the same Vector object from within all of your runnables. This is likely to cause ConcurrentModificationExceptions, or just clobber stuff in the vector. You need to either manually synchronize on the vector inside of QueryTask, or pass in a thread-safe container instead, like Collections.synchronizedList( new ArrayList() );
I am fairly naive when it comes to the world of Java Threading and Concurrency. I am currently trying to learn. I made a simple example to try to figure out how concurrency works.
Here is my code:
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class ThreadedService {
private ExecutorService exec;
/**
* #param delegate
* #param poolSize
*/
public ThreadedService(int poolSize) {
if (poolSize < 1) {
this.exec = Executors.newCachedThreadPool();
} else {
this.exec = Executors.newFixedThreadPool(poolSize);
}
}
public void add(final String str) {
exec.execute(new Runnable() {
public void run() {
System.out.println(str);
}
});
}
public static void main(String args[]) {
ThreadedService t = new ThreadedService(25);
for (int i = 0; i < 100; i++) {
t.add("ADD: " + i);
}
}
}
What do I need to do to make the code print out the numbers 0-99 in sequential order?
Thread pools are usually used for operations which do not need synchronization or are highly parallel.
Printing the numbers 0-99 sequentially is not a concurrent problem and requires threads to be synchronized to avoid printing out of order.
I recommend taking a look at the Java concurrency lesson to get an idea of concurrency in Java.
The idea of threads is not to do things sequentially.
You will need some shared state to coordinate. In the example, adding instance fields to your outer class will work in this example. Remove the parameter from add. Add a lock object and a counter. Grab the lock, increment print the number, increment the number, release the number.
The simplest solution to your problem is to use a ThreadPool size of 1. However, this isn't really the kind of problem one would use threads to solve.
To expand, if you create your executor with:
this.exec = Executors.newSingleThreadExecutor();
then your threads will all be scheduled and executed in the order they were submitted for execution. There are a few scenarios where this is a logical thing to do, but in most cases Threads are the wrong tool to use to solve this problem.
This kind of thing makes sense to do when you need to execute the task in a different thread -- perhaps it takes a long time to execute and you don't want to block a GUI thread -- but you don't need or don't want the submitted tasks to run at the same time.
The problem is by definition not suited to threads. Threads are run independently and there isn't really a way to predict which thread is run first.
If you want to change your code to run sequentially, change add to:
public void add(final String str) {
System.out.println(str);
}
You are not using threads (not your own at least) and everything happens sequentially.