An optimum of threads in a pool is something that is case specific, though there is a rule of thumb which says #threads = #CPU +1.
However, how does this work with threads spanning other threads and waiting (i.e. blocked until thread.join() is successful) for these 'subthreads'?
Assume that I have code that requires the execution of list of tasks (2), which has subtasks(2), which has subsubtasks(3) and so on. The total number of tasks is 2*2*3 = 12, though 18 threads will be created (because a threads will 'spawn' more subtasks (threads), where the thread spawning more threads will be blocked untill all is over. See below for pseudo code.
I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (12) is #CPU + 1. Is this correct?
PseudoCode
outputOfTask = []
for subtask in SubTaskList
outputOfTask --> append(subtask.doCompute())
// wait untill all output is finished.
in subtask.java:
Each subtask, for example, implements the same interface, but can be different.
outputOfSubtask = []
for task in subsubTaskList
// do some magic depending on the type of subtask
outputOfSubtask -> append( task.doCompute())
return outputOfSubtask
in subsubtask.java:
outputOfSubsubtask = []
for task in subsubsubtask
// do some magic depending on the type of subsubtask
outputOfSubsubtask -> append( task.doCompute())
return outputOfSubsubtask
EDIT:
Dummy code Java code. I used this in my original question to check how many threads were active, but I assume that the pseudocode is more clear. Please note: I used the Eclipse Collection, this introduces the asParallel function which allows for a shorter notation of the code.
#Test
public void testasParallelthreads() {
// // ExecutorService executor = Executors.newWorkStealingPool();
ExecutorService executor = Executors.newCachedThreadPool();
MutableList<Double> myMainTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubSubTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubSubSubTask = Lists.mutable.with(1.0, 2.0, 2.0);
MutableList<Double> a = myMainTask.asParallel(executor, 1)
.flatCollect(task -> mySubTask.asParallel(executor,1)
.flatCollect(subTask -> mySubSubTask.asParallel(executor, 1)
.flatCollect(subsubTask -> mySubSubSubTask.asParallel(executor, 1)
.flatCollect(subsubTask -> dummyFunction(task, subTask, subsubTask, subsubTask,executor))
.toList()).toList()).toList()).toList();
System.out.println("pool size: " + ((ThreadPoolExecutor) executor).getPoolSize());
executor.shutdownNow();
}
private MutableList<Double> dummyFunction(double a, double b, double c, double d, ExecutorService ex) {
System.out.println("ThreadId: " + Thread.currentThread().getId());
System.out.println("Active threads size: " + ((ThreadPoolExecutor) ex).getActiveCount());
return Lists.mutable.with(a,b,c,d);
}
I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (12) is #CPU + 1. Is this correct?
This topic is extremely hard to generalize about. Even with the actual code, the performance of your application is going to be very difficult to determine. Even if you could come up an estimation, the actual performance may vary wildly between runs – especially considering that the threads are interacting with each other. The only time we can take the #CPU + 1 number is if the jobs that are submitted into the thread-pool are independent and completely CPU bound.
I'd recommend trying a number of different thread-pool size values under simulated load to find the optimal values for your application. Examining the overall throughput numbers or system load stats should give you the feedback you need.
However, how does this work with threads spanning other threads and waiting (i.e. blocked until thread.join() is successful) for these 'subthreads'?
Threads will block, and it is up to the os/jvm to schedule another one if possible. If you have a single thread pool executor and call join from one of your tasks, the other task won't even get started. With executors that use more threads, then the blocking task will block a single thread and the os/jvm is free to scheduled other threads.
These blocked threads should not consume CPU time, because they are blocked. So I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (24) is #CPU + 1. Is this correct?
Active threads can be blocking. I think you're mixing terms here, #CPU, the number of cores, and the number of virtual cores. If you have N physical cores, then you can run N cpu bound tasks in parallel. When you have other types of blocking or very short lived tasks, then you can have more parallel tasks.
Related
Here is my code,
class Shared {
private static int index = 0;
public synchronized void printThread() {
try {
while(true) {
Thread.sleep(1000);
System.out.println(Thread.currentThread().getName() + ": " + index++);
notifyAll();
// notify();
wait();
}
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
class Example13 implements Runnable {
private Shared shared = new Shared();
#Override
public void run() {
shared.printThread();
}
}
public class tetest {
public static void main(String[] args) {
Example13 r = new Example13();
Thread t1 = new Thread(r, "Thread 1");
Thread t2 = new Thread(r, "Thread 2");
Thread t3 = new Thread(r, "Thread 3");
Thread t4 = new Thread(r, "Thread 4");
Thread t5 = new Thread(r, "Thread 5");
t1.start();
t2.start();
t3.start();
t4.start();
t5.start();
}
}
and the result is here
Thread 1: 0
Thread 5: 1
Thread 4: 2
Thread 3: 3
Thread 2: 4
Thread 3: 5
Thread 2: 6
Thread 3: 7
Thread 2: 8
Thread 3: 9
the question is, why only two of the threads are working? I'm so confused, I thought notify() is randomly wake up one of the waiting threads, but it's not.
is this starvation? then, why the starvation is caused? I tried notify() and notifyAll(), but got same results for both.
Can any one help my toasted brain?
This isn't 'starvation'. Your 5 threads are all doing nothing. They all want to 'wake up' - notify() will wake up an arbitrary one. It is neither unreliably random: The JMM does not ascribe an order to this, so one of them will wake up, you can't rely on it being random (do not use this to generate random numbers), nor can you rely on specific ordering behaviour.
It's not starvation (it's not: Oh no! Thread 2 and 3 are doing all the work and 4 and 5 are just hanging out doing nothing! That's bad - the system could be more efficient!) - because it doesn't matter which thread 'does the work'. A CPU core is a CPU core, it cares not which thread it ends up running.
Starvation is a different principle. Imagine instead of Thread.sleep (which means the threads aren't waiting for anything specific, other than some time to elapse), instead the threads want to print the result of some expensiv-ish math operation. If you just let 2 threads each say 'Hello!', then the impl of System.out says that it would be acceptable for the JVM to produce:
HelHellloo!!
So to prevent this, you use locks to create a 'talking stick' of sorts: A thread only gets to print if it has the talking stick. Each of 5 threads will all perform, in a loop, the following operation:
Do an expensive math operation.
Acquire the talking stick.
Print the result of the operation.
Release the talking stick.
Loop back to the top.
Now imagine that, despite the fact that the math operation is quite expensive, for whatever reason you have an excruciatingly slow terminal, and the 'print the result of the operation' job takes a really long time to finish.
Now you can run into starvation. Imagine this scenario:
Threads 1-5 all do their expensive math simultaneously.
Arbitrarily, thread 4 ends up nabbing the talking stick.
The other 4 threads soon too want the talking stick but they have to wait; t4 has it. They do nothing now. Twiddling their thumbs (they could be calculating, but they are not!)
after the excruciatingly long time, 4 is done and releases the stick. 1, 2, 3, and 5 dogpile on like its the All-Blacks and 2 so happens to win the scrum and crawls out of the pile with the stick. 1, 3, and 5 gnash their teeth and go back yet again to waiting for the stick, still not doing any work. Whilst 2 is busy spending the really long time printing results, 4 goes back to the top of the loop and calculates another result. It ends up doing this faster than 2 manages to print, so 4 ends up wanting the talking stick again before 2 is done.
2 is finally done and 1, 3, 4, and 5 all scrum into a pile again. 4 happens to get the stick - java makes absolutely no guarantees about fairness, any one of them can get it, there is also no guarantee of randomness or lack thereof. A JVM is not broken if 4 is destined to win this fight.
Repeat ad nauseam. 2 and 4 keep trading the stick back and forth. 1, 3, and 5 never get to talk.
The above is, as per the JMM, valid - a JVM is not broken if it behaves like this (it would be a tad weird). Any bugs filed about this behaviour would be denied: The lock isn't so-called "fair". Java has fair locks if you want them - in the java.util.concurrent package. Fair locks incur some slight extra bookkeeping cost, the assumption made by the synchronized and wait/notify system is that you don't want to pay this extra cost.
A better solution to the above scenario is possibly to make a 6th thread that JUST prints, with 5 threads JUST filling a buffer, at least then the 'print' part is left to a single thread and that might be faster. But mostly, the bottleneck in this getup is simply that printing - the code has zero benefit from being multicore (just having ONE thread do ONE math calculation, print it, do another, and so forth would be better. Or possibly 2 threads: Whilst printing, the other thread calculates a number, but there's no point in having more than one; even a single thread can calculate faster than results can be printed). Thus in some ways this is just what the situation honestly requires: This hypothetical scenario still prints as fast as it can. IF you need the printing to be 'fair' (and who says that? It's not intrinsic to the problem description that fairness is a requirement. Maybe all the various calculations are equally useful so it doesn't matter that one thread gets to print more than others; let's say its bitcoin miners, generating a random number and checking if that results in a hash with the requisite 7 zeroes at the end or whatever bitcoin is up to now - who cares that one thread gets more time than another? A 'fair' system is no more likely to successfully mine a block).
Thus, 'fairness' is something you need to explicitly determine you actually need. If you do, AND starvation is an issue, use a fair lock. new ReentrantLock(true) is all you need (that boolean parameter is the fair parameter - true means you want fairness).
I dont understand why I have 4 Threads RUNNING but only 50% of processor's capacity used : indeed that's mean only 2/4 processors are used.
EDIT : I think this is due to the limit : my mistake is 5 Threads are RUNNING at the same time so by default the System limit the %CPU to 2 cores (50%).
I am going to check about 4 Threads
This very much depends on what your threads are doing.
If the work they are doing is heavily focused on IO operations, then your CPUs can run many many such threads - without getting to any significant CPU load ever.
In other words: most likely, your threads are not doing CPU intensive work.
But we can't know for sure, as you are not giving any hints about the nature of your application.
First, it depends on how many CPU cores do you have - if you have more CPU cores than running threads, then there aren't enough threads to keep all the cores in the processor busy at 100%.
Another thing is that a thread can be in a waiting state eg. waiting on monitor and in such case it doesn't consume the CPU cycles.
In your screenshot one of threads from your pool is in MONITOR state - it is not executing at that moment - it is waiting for something, and as such not consuming CPU cycles.
I think that all the threads in your pool are similar and all share the characteristic of having some potential wait on a monitor - and this limits the possibility of consuming all the CPU cores at 100%.
For example, this simple program should consume all of your cores at 100% simply because it doesn't have any waiting, but if you uncomment the line that orders it to sleep 1 nanosecond Thread.sleep(0, 1); then you will hardly notice any cpu load.
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class consumeMaxOfCPU {
public static void main(String[] args) {
int availableProcessors = Runtime.getRuntime().availableProcessors();
// availableProcessors = availableProcessors /2; // uncomment this line to see around half of the load - because there will be less threads than CPU cores.
ExecutorService pool = Executors.newFixedThreadPool(availableProcessors);
for (int n = 0; n < availableProcessors; n++) {
pool.submit(new HeavyTask(n));
}
}
private static class HeavyTask implements Callable<Long> {
private long n;
public HeavyTask(long n) {
this.n = n;
}
#Override
public Long call() throws Exception {
// there are very little chances that this will finish quickly :)
while (n != -10) {
n = n * n;
// Thread.sleep(0, 1); // uncomment this line to see almost no load because of this sleep.
}
return n;
}
}
}
from time to time have to implement the classic concurrent producer-consumer solution across the project I'm involved in, pretty much the problem is reduced in having some collection which gets populated from multiple threads and which is being consumed by several consumers.
In a nutshell the collection say is bounded to 10k entities,
once buffer size is hit a worker task is submitted consuming these 10k entities, there is a limit of such workers say its set to 10, which in worst case scenario means I can have up to 10 workers each consuming 10k entities.
I do have to play with some locking here and there some checks around buffer overflows (case when producers generate too much data while all workers are busy processing their chunks) thus have to discard new events to avoid OOM (not the best solution but stability is p1 ;))
Was looking these days around reactor and a way to use it instead of going low level and do all the things described above, so the dumb question is: "can reactor be used for this use case?"
for now forget about overflow/discarding.. how can i achieve the N consumers for a broadcaster?
was looking particularly around broadcaster with the buffer + a thread pooled dispatcher:
void test() {
final Broadcaster<String> sink = Broadcaster.create(Environment.initialize());
Dispatcher dispatcher = Environment.newDispatcher(2048, 20, DispatcherType.WORK_QUEUE);
sink
.buffer(100)
.consumeOn(dispatcher, this::log);
for (int i=0; i<100000; i++) {
sink.onNext("elementent " + i);
if (i%1000 == 0) {
System.out.println("addded elements " + i);
}
}
}
void log(List<String> values) {
System.out.print("simulating slow processing....");
System.out.println("processing: " + Arrays.toString(values.toArray()));
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
my intention here is have a broadcaster execute the log(..) in asynch manner when buffer size was reached, however it looks like it is always executing log(...) it in blocking mode. executing 100 once done next 100 and so on.. how can i make it asynch ?
thanks
vyvalyty
A possible pattern is to use flatMap with publishOn:
Flux.range(1, 1_000_000)
.buffer(100)
.flatMap(b -> Flux.just(b).publishOn(SchedulerGroup.io())
.doOnNext(this::log))
.consume(...);
I just have implemented a threaded version of the merge sort. ThreadedMerge.java: http://pastebin.com/5ZEvU6BV
Since merge sort is a divide and conquer algorithm I create a thread for every half of the array. But the number of avialable threads in Java-VM is limited so I check that before creating threads:
if(num <= nrOfProcessors){
num += 2;
//create more threads
}else{
//continue without threading
}
However the threaded sorting takes about ~ 6000 ms while the non-threaded version is much faster with just ~ 2500 ms.
Non-Threaded: http://pastebin.com/7FdhZ4Fw
Why is the threaded version slower and how do I solve that problem?
Update: I use atomic integer now for thread counting and declared a static field for Runtime.getRuntime().availableProcessors(). The sorting takes about ~ 1400 ms now.
However creating just one thread in the mergeSort method and let the current thread do the rest has no sigificant performance increase. Why?
Besides when after I call join on a thread and after that decrement the number of used threads with
num.set(num.intValue() - 1);
the sorting takes about ~ 200 ms longer. Here is the update of my algorithm http://pastebin.com/NTZq5zQp Why does this line of code make it even worse?
first off your accesses to num is not threadsafe (check http://download.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicInteger.html )
you create an equal amount of processes to cores but you block half of them with the join call
num += 1;
ThreadedMerge tm1 = new ThreadedMerge(array, startIndex, startIndex + halfLength);
tm1.start();
sortedRightPart = mergeSort(array, startIndex + halfLength, endIndex);
try{
tm1.join();
num-=1
sortedLeftPart = tm1.list;
}catch(InterruptedException e){
}
this doesn't block the calling thread but uses it to sort the right part and let the created thread do the other part when that one returns the space it takes up can be used by another thread
Hhm, you should not create a thread for every single step (they are expensive and there are lightweight alternatives.)
Ideally, you should only create 4 threads if there are 4 CPU´s.
So let´s say you have 4 CPU´s, then you create one thread at the first level (now you have 2) and at the second level you also create a new thread. This gives you 4.
The reason why you only create one and not two is that you can use the thread you are currently running like:
Thread t = new Thread(...);
t.start();
// Do half of the job here
t.join(); // Wait for the other half to complete.
If you have, let´s say, 5 CPU´s (not in the power of two) then just create 8 threads.
One simple way to do this in practice, is to create the un-threaded version you already made when you reach the appropriate level. In this way you avoid to clutter the merge method when if-sentences etc.
The call to Runtime.availableProcessors() appears to be taking up a fair amount of extra time. You only need to call it once, so just move it outside of the method and define it as a static, e.g.:
static int nrOfProcessors = Runtime.getRuntime().availableProcessors();
I’m dealing with multithreading in Java and, as someone pointed out to me, I noticed that threads warm up, it is, they get faster as they are repeatedly executed. I would like to understand why this happens and if it is related to Java itself or whether it is a common behavior of every multithreaded program.
The code (by Peter Lawrey) that exemplifies it is the following:
for (int i = 0; i < 20; i++) {
ExecutorService es = Executors.newFixedThreadPool(1);
final double[] d = new double[4 * 1024];
Arrays.fill(d, 1);
final double[] d2 = new double[4 * 1024];
es.submit(new Runnable() {
#Override
public void run() {
// nothing.
}
}).get();
long start = System.nanoTime();
es.submit(new Runnable() {
#Override
public void run() {
synchronized (d) {
System.arraycopy(d, 0, d2, 0, d.length);
}
}
});
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
// get a the values in d2.
for (double x : d2) ;
long time = System.nanoTime() - start;
System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}
Results:
Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.
I.e. it gets faster and stabilises around 50 ns. Why is that?
If I run this code (20 repetitions), then execute something else (lets say postprocessing of the previous results and preparation for another mulithreading round) and later execute the same Runnable on the same ThreadPool for another 20 repetitions, it will be warmed up already, in any case?
On my program, I execute the Runnable in just one thread (actually one per processing core I have, its a CPU-intensive program), then some other serial processing alternately for many times. It doesn’t seem to get faster as the program goes. Maybe I could find a way to warm it up…
It isn't the threads that are warming up so much as the JVM.
The JVM has what's called JIT (Just In Time) compiling. As the program is running, it analyzes what's happening in the program and optimizes it on the fly. It does this by taking the byte code that the JVM runs and converting it to native code that runs faster. It can do this in a way that is optimal for your current situation, as it does this by analyzing the actual runtime behavior. This can (not always) result in great optimization. Even more so than some programs that are compiled to native code without such knowledge.
You can read a bit more at http://en.wikipedia.org/wiki/Just-in-time_compilation
You could get a similar effect on any program as code is loaded into the CPU caches, but I believe this will be a smaller difference.
The only reasons I see that a thread execution can end up being faster are:
The memory manager can reuse already allocated object space (e.g., to let heap allocations fill up the available memory until the max memory is reached - the Xmx property)
The working set is available in the hardware cache
Repeating operations might create operations the compiler can easier reorder to optimize execution