Java thread executing remainder operation in a loop blocks all other threads

Java thread executing remainder operation in a loop blocks all other threads - java

The following code snippet executes two threads, one is a simple timer logging every second, the second is an infinite loop that executes a remainder operation:
public class TestBlockingThread {
private static final Logger LOGGER = LoggerFactory.getLogger(TestBlockingThread.class);
public static final void main(String[] args) throws InterruptedException {
Runnable task = () -> {
int i = 0;
while (true) {
i++;
if (i != 0) {
boolean b = 1 % i == 0;
}
}
};
new Thread(new LogTimer()).start();
Thread.sleep(2000);
new Thread(task).start();
}
public static class LogTimer implements Runnable {
#Override
public void run() {
while (true) {
long start = System.currentTimeMillis();
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// do nothing
}
LOGGER.info("timeElapsed={}", System.currentTimeMillis() - start);
}
}
}
}
This gives the following result:
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1004
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1003
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=13331
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1006
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1003
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1004
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1004
I don't understand why the infinite task blocks all other threads for 13.3 seconds. I tried to change thread priorities and other settings, nothing worked.
If you have any suggestions to fix this (including tweaking OS context switching settings) please let me know.

After all the explanations here (thanks to Peter Lawrey) we found that the main source of this pause is that safepoint inside the loop is reached rather rarely so it takes a long time to stop all threads for JIT-compiled code replacement.
But I decided to go deeper and find why safepoint is reached rarely. I found it a bit confusing why the back jump of while loop is not "safe" in this case.
So I summon -XX:+PrintAssembly in all its glory to help
-XX:+UnlockDiagnosticVMOptions \
-XX:+TraceClassLoading \
-XX:+DebugNonSafepoints \
-XX:+PrintCompilation \
-XX:+PrintGCDetails \
-XX:+PrintStubCode \
-XX:+PrintAssembly \
-XX:PrintAssemblyOptions=-Mintel
After some investigation I found that after third recompilation of lambda C2 compiler threw away safepoint polls inside loop completely.
UPDATE
During the profiling stage variable i was never seen equal to 0. That's why C2 speculatively optimized this branch away, so that the loop was transformed to something like
for (int i = OSR_value; i != 0; i++) {
if (1 % i == 0) {
uncommon_trap();
}
}
uncommon_trap();
Note that originally infinite loop was reshaped to a regular finite loop with a counter! Due to JIT optimization to eliminate safepoint polls in finite counted loops, there was no safepoint poll in this loop either.
After some time, i wrapped back to 0, and the uncommon trap was taken. The method was deoptimized and continued execution in the interpreter. During recompilation with a new knowledge C2 recognized the infinite loop and gave up compilation. The rest of the method proceeded in the interpreter with proper safepoints.
There is a great must-read blog post "Safepoints: Meaning, Side Effects and Overheads" by Nitsan Wakart covering safepoints and this particular issue.
Safepoint elimination in very long counted loops is known to be a problem. The bug JDK-5014723 (thanks to Vladimir Ivanov) addresses this problem.
The workaround is available until the bug is finally fixed.
You can try using -XX:+UseCountedLoopSafepoints (it will cause overall performance penalty and may lead to JVM crash JDK-8161147). After using it C2 compiler continue keeping safepoints at the back jumps and original pause disappears completely.
You can explicitly disable compilation of problematic method by using
-XX:CompileCommand='exclude,binary/class/Name,methodName'
Or you can rewrite your code by adding safepoint manually. For example Thread.yield() call at the end of cycle or even changing int i to long i (thanks, Nitsan Wakart) will also fix pause.

In short, the loop you have has no safe point inside it except when i == 0 is reached. When this method is compiled and triggers the code to be replaced it needs to bring all the threads to a safe point, but this takes a very long time, locking up not just the thread running the code but all threads in the JVM.
I added the following command line options.
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintCompilation
I also modified the code to use floating point which appears to take longer.
boolean b = 1.0 / i == 0;
And what I see in the output is
timeElapsed=100
Application time: 0.9560686 seconds
41423 280 % 4 TestBlockingThread::lambda$main$0 # -2 (27 bytes) made not entrant
Total time for which application threads were stopped: 40.3971116 seconds, Stopping threads took: 40.3967755 seconds
Application time: 0.0000219 seconds
Total time for which application threads were stopped: 0.0005840 seconds, Stopping threads took: 0.0000383 seconds
41424 281 % 3 TestBlockingThread::lambda$main$0 # 2 (27 bytes)
timeElapsed=40473
41425 282 % 4 TestBlockingThread::lambda$main$0 # 2 (27 bytes)
41426 281 % 3 TestBlockingThread::lambda$main$0 # -2 (27 bytes) made not entrant
timeElapsed=100
Note: for code to be replaced, threads have to be stopped at a safe point. However it appears here that such a safe point is reached very rarely (possibly only when i == 0 Changing the task to
Runnable task = () -> {
for (int i = 1; i != 0 ; i++) {
boolean b = 1.0 / i == 0;
}
};
I see a similar delay.
timeElapsed=100
Application time: 0.9587419 seconds
39044 280 % 4 TestBlockingThread::lambda$main$0 # -2 (28 bytes) made not entrant
Total time for which application threads were stopped: 38.0227039 seconds, Stopping threads took: 38.0225761 seconds
Application time: 0.0000087 seconds
Total time for which application threads were stopped: 0.0003102 seconds, Stopping threads took: 0.0000105 seconds
timeElapsed=38100
timeElapsed=100
Adding code to the loop carefully you get a longer delay.
for (int i = 1; i != 0 ; i++) {
boolean b = 1.0 / i / i == 0;
}
gets
Total time for which application threads were stopped: 59.6034546 seconds, Stopping threads took: 59.6030773 seconds
However, change the code to use a native method which always has a safe point (if it is not an intrinsic)
for (int i = 1; i != 0 ; i++) {
boolean b = Math.cos(1.0 / i) == 0;
}
prints
Total time for which application threads were stopped: 0.0001444 seconds, Stopping threads took: 0.0000615 seconds
Note: adding if (Thread.currentThread().isInterrupted()) { ... } to a loop adds a safe point.
Note: This happened on a 16 core machine so there is no lack of CPU resources.

Found the answer of why. They are called safepoints, and are best known as the Stop-The-World that happens because of GC.
See this articles: Logging stop-the-world pauses in JVM
Different events can cause the JVM to pause all the application threads. Such pauses are called Stop-The-World (STW) pauses. The most common cause for an STW pause to be triggered is garbage collection (example in github) , but different JIT actions (example), biased lock revocation (example), certain JVMTI operations , and many more also require the application to be stopped.
The points at which the application threads may be safely stopped are called, surprise, safepoints. This term is also often used to refer to all the STW pauses.
It is more or less common that GC logs are enabled. However, this does not capture information on all the safepoints. To get it all, use these JVM options:
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
If you are wondering about the naming explicitly referring to GC, don’t be alarmed – turning on these options logs all of the safepoints, not just garbage collection pauses. If you run a following example (source in github) with the flags specified above.
Reading the HotSpot Glossary of Terms, it defines this:
safepoint
A point during program execution at which all GC roots are known and all heap object contents are consistent. From a global point of view, all threads must block at a safepoint before the GC can run. (As a special case, threads running JNI code can continue to run, because they use only handles. During a safepoint they must block instead of loading the contents of the handle.) From a local point of view, a safepoint is a distinguished point in a block of code where the executing thread may block for the GC. Most call sites qualify as safepoints. There are strong invariants which hold true at every safepoint, which may be disregarded at non-safepoints. Both compiled Java code and C/C++ code be optimized between safepoints, but less so across safepoints. The JIT compiler emits a GC map at each safepoint. C/C++ code in the VM uses stylized macro-based conventions (e.g., TRAPS) to mark potential safepoints.
Running with the above mentioned flags, I get this output:
Application time: 0.9668750 seconds
Total time for which application threads were stopped: 0.0000747 seconds, Stopping threads took: 0.0000291 seconds
timeElapsed=1015
Application time: 1.0148568 seconds
Total time for which application threads were stopped: 0.0000556 seconds, Stopping threads took: 0.0000168 seconds
timeElapsed=1015
timeElapsed=1014
Application time: 2.0453971 seconds
Total time for which application threads were stopped: 10.7951187 seconds, Stopping threads took: 10.7950774 seconds
timeElapsed=11732
Application time: 1.0149263 seconds
Total time for which application threads were stopped: 0.0000644 seconds, Stopping threads took: 0.0000368 seconds
timeElapsed=1015
Notice the third STW event:
Total time stopped: 10.7951187 seconds
Stopping threads took: 10.7950774 seconds
JIT itself took virtually no time, but once the JVM had decided to perform a JIT compilation, it entered STW mode, however since the code to be compiled (the infinite loop) doesn't have a call site, no safepoint was ever reached.
The STW ends when JIT eventually gives up waiting and concludes the code is in an infinite loop.

After following the comment threads and some testing on my own, I believe that the pause is caused by the JIT compiler. Why the JIT compiler is taking such a long time is beyond my ability to debug.
However, since you only asked for how to prevent this, I have a solution:
Pull your infinite loop into a method where it can be excluded from the JIT compiler
public class TestBlockingThread {
private static final Logger LOGGER = Logger.getLogger(TestBlockingThread.class.getName());
public static final void main(String[] args) throws InterruptedException {
Runnable task = () -> {
infLoop();
};
new Thread(new LogTimer()).start();
Thread.sleep(2000);
new Thread(task).start();
}
private static void infLoop()
{
int i = 0;
while (true) {
i++;
if (i != 0) {
boolean b = 1 % i == 0;
}
}
}
Run your program with this VM argument:
-XX:CompileCommand=exclude,PACKAGE.TestBlockingThread::infLoop (replace PACKAGE with your package information)
You should get a message like this to indicate when the method would have been JIT-compiled: ### Excluding compile: static blocking.TestBlockingThread::infLoop
you may notice that I put the class into a package called blocking

Related

java for loop performance difference

I am running below simple program , I know this is not best way to measure performance but the results are surprising to me , hence wanted to post question here.
public class findFirstTest {
public static void main(String[] args) {
for(int q=0;q<10;q++) {
long start2 = System.currentTimeMillis();
int k = 0;
for (int j = 0; j < 5000000; j++) {
if (j > 4500000) {
k = j;
break;
}
}
System.out.println("for value " + k + " with time " + (System.currentTimeMillis() - start2));
}
}
}
results are like below after multiple times running code.
for value 4500001 with time 3
for value 4500001 with time 25 ( surprised as it took 25 ms in 2nd iteration)
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
for value 4500001 with time 0
so I am not understanding why 2nd iteration took 25ms but 1st 3ms and later 0 ms and also why always for 2nd iteration when I am running code.
if I move start and endtime printing outside of outer forloop then results I am having is like
for value 4500001 with time 10

In first iteration, the code is running interpreted.
In second iteration, JIT kicks in, slowing it down a bit while it compiles to native code.
In remaining iterations, native code runs very fast.

Because your winamp needed to decode another few frames of your mp3 to queue it into the sound output buffers. Or because the phase of the moon changed a bit and your dynamic background needed changing, or because someone in east Croydon farted and your computer is subscribed to the 'smells from London' twitter feed. Who knows?
This isn't how you performance test. Your CPU is not such a simple machine after all; it has many cores, and each core has pipelines and multiple hierarchies of caches. Any given core can only interact with one of its caches, and because of this, if a core runs an instruction that operates on memory which is not currently in cache, then the core will shut down for a while: It sends to the memory controller a request to load the page of memory with the memory you need to access into a given cachepage, and will then wait until it is there; this can take many, many cycles.
On the other end you have an OS that is juggling hundreds of thousands of processes and threads, many of them internal to the kernel, per-empting like there is no tomorrow, and trying to give extra precedence to processes that are time sensitive, such as the aforementioned winamp which must get a chance to decode some more mp3 frames before the sound buffer is fully exhausted, or you'd notice skipping. This is non-trivial: On ye olde windows you just couldn't get this done which is why ye olde winamp was a magical marvel of engineering, more or less hacking into windows to ensure it got the priority it needed. Those days are long gone, but if you remember them, well, draw the conclusion that this isn't trivial, and thus, OSes do pre-empt with prejudice all the time these days.
A third significant factor is the JVM itself which is doing all sorts of borderline voodoo magic, as it has both a hotspot engine (which is doing bookkeeping on your code so that it can eventually conclude that it is worth spending considerable CPU resources to analyse the heck out of a method to rewrite it in optimized machinecode because that method seems to be taking a lot of CPU time), and a garbage collector.
The solution is to forget entirely about trying to measure time using such mere banalities as measuring currentTimeMillis or nanoTime and writing a few loops. It's just way too complicated for that to actually work.
No. Use JMH.

Threads run in serial not parallel

I am trying to learn concurrency in Java, but whatever I do, 2 threads run in serial, not parallel, so I am not able to replicate common concurrency issues explained in tutorials (like thread interference and memory consistency errors). Sample code:
public class Synchronization {
static int v;
public static void main(String[] args) {
Runnable r0 = () -> {
for (int i = 0; i < 10; i++) {
Synchronization.v++;
System.out.println(v);
}
};
Runnable r1 = () -> {
for (int i = 0; i < 10; i++) {
Synchronization.v--;
System.out.println(v);
}
};
Thread t0 = new Thread(r0);
Thread t1 = new Thread(r1);
t0.start();
t1.start();
}
}
This always give me a result starting from 1 and ending with 0 (whatever the loop length is). For example, the code above gives me every time:
1
2
3
4
5
6
7
8
9
10
9
8
7
6
5
4
3
2
1
0
Sometimes, the second thread starts first and the results are the same but negative, so it is still running in serial.
Tried in both Intellij and Eclipse with identical results. CPU has 2 cores if it matters.
UPDATE: it finally became reproducible with huge loops (starting from 1_000_000), though still not every time and just with small amount of final discrepancy. Also seems like making operations in loops "heavier", like printing thread name makes it more reproducible as well. Manually adding sleep to thread also works, but it makes experiment less cleaner, so to say. The reason doesn't seems to be that first loop finishes before the second starts, because I see both loops printing to console while continuing operating and still giving me 0 at the end. The reasons seems more like a thread race for same variable. I will dig deeper into that, thanks.

Seems like first started thread just never give a chance to second in Thread Race to take a variable/second one just never have a time to even start (couldn't say for sure), so the second almost* always will be waiting until first loop will be finished.
Some heavy operation will mix the result:
TimeUnit.MILLISECONDS.sleep(100);
*it is not always true, but you are was lucky in your tests

Starting a thread is heavyweight operation, meaning that it will take some time to perform. Due that fact, by the time you start second thread, first is finished.
The reasoning why sometimes it is in "revert order" is due how thread scheduler works. By the specs there are not guarantees about thread execution order - having that in mind, we know that it is possible for second thread to run first (and finish)
Increase iteration count to something meaningful like 10000 and see what will happen then.

This is called lucky timing as per Brian Goetz (Author of Java Concurrency In Practice). Since there is no synchronization to the static variable v it is clear that this class is not thread-safe.

Why is there more than one thread? [duplicate]

I have a java program which has 13 threads, though only one of them is at 99% cpu usage and has been running for ~24 hours. The others are at 0.0% cpu usage and show a TIME+ of anywhere from 0:00.0 to 0:12.82 and one has 3:51.48. The program is intended to be a single threaded program, so I'm wondering why the other threads are there?
What are they doing and why do they show so little cpu usage and TIME+?
UPDATE: I have an old java program I wrote (first program - don't judge me!) which is single threaded and shows the same type of thread usage ...
import java.io.*;
class xdriver {
static int N = 100;
static double pi = 3.141592653589793;
static double one = 1.0;
static double two = 2.0;
public static void main(String[] args) {
//System.out.println("Program has started successfully\n");
if( args.length == 1) {
// assume that args[0] is an integer
N = Integer.parseInt(args[0]);
}
// maybe we can get user input later on this ...
int nr = N;
int nt = N;
int np = 2*N;
double dr = 1.0/(double)(nr-1);
double dt = pi/(double)(nt-1);
double dp = (two*pi)/(double)(np-1);
System.out.format("nn --> %d\n", nr*nt*np);
if(nr*nt*np < 0) {
System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n", (long)( (long)nr*(long)nt*(long)np), nr*nt*np);
System.exit(1);
}
// inserted to artificially blow up RAM
double[][] dels = new double [nr*nt*np][3];
double[] rs = new double[nr];
double[] ts = new double[nt];
double[] ps = new double[np];
for(int ir = 0; ir < nr; ir++) {
rs[ir] = dr*(double)(ir);
}
for(int it = 0; it < nt; it++) {
ts[it] = dt*(double)(it);
}
for(int ip = 0; ip < np; ip++) {
ps[ip] = dp*(double)(ip);
}
double C = (4.0/3.0)*pi;
C = one/C;
double fint = 0.0;
int ii = 0;
for(int ir = 0; ir < nr; ir++) {
double r = rs[ir];
double r2dr = r*r*dr;
for(int it = 0; it < nt; it++) {
double t = ts[it];
double sint = Math.sin(t);
for(int ip = 0; ip < np; ip++) {
fint += C*r2dr*sint*dt*dp;
dels[ii][0] = dr;
dels[ii][1] = dt;
dels[ii][2] = dp;
}
}
}
System.out.format("N ........ %d\n", N);
System.out.format("fint ..... %15.10f\n", fint);
System.out.format("err ...... %15.10f\n", Math.abs(1.0-fint));
}
}

Starting a Java program means starting a JVM and telling it which main class to run (that usually has a static main method).
This JVM spawns several background threads in addition to the above mentioned main thread.
Among them are
The VM thread: An observing thread that waits for tasks that require the VM to be at a safe point. For example, there is a garbage collection task that fully "stops the world". But there are others.
GC threads: Several threads that are maintained to run the garbage collection.
Compile threads: These threads are used to compile byte code into native machine code.
There might be many more.
Additionally, if you are using AWT or Swing, you will get some more threads from these frameworks. One of them is the so-called event dispatcher thred (EDT). And - of course - there might be thread that you did create and run: timers, executors, or simply some arbitrary threads. Even for a simple hello world application there might be a dozen of threads running.
But most of these threads are more waiting than doing something. So chances are high that only one thread is really working, thus utilizing some CPU.
Although ... 100% CPU utilization might be an indicator of some problem. A never-ending loop, for example. You have to use a profiler for finding out what really happens. But it could be simply a program that has such a CPU utilitation. Your judgement.

Quoting the discussion done here and other research.
Few of core JVM threads:
Attach Listener: This is thread which always listens for other JVM threads to send a request. A practical example is in case of profiling or (I am not about this though) production level application monitoring tools like DynaTrace.
Signal Dispatcher: When the OS raises a signal to the JVM, the signal dispatcher thread will pass the signal to the appropriate handler.
Reference Handler: High-priority thread to enqueue pending References. The GC creates a simple linked list of references which need to be processed and this thread quickly adds them to a proper queue and notifies ReferenceQueue listeners.
Finalizer: The Finalizer thread calls finalizer methods.
DestroyJavaVM: This thread unloads the Java VM on program exit. Most of the time it should be waiting.
Garbage Collector: Thread responsible for Java's garbage collection mechanism, depending upon if GC is enabled or not.
main: Main thread running your program containing main method.
One important point to note is that it will depend upon JVM implementations that how many and which all core threads it will start but even if Java program is written to be a single threaded, there would be more than one thread in JVM.
Java program can be single threaded but JVM (which will run user defined Java program) is multi-threaded and will have (at-least for latest JVMs) more than one thread even from start.
Below is a snapshot of my Java HotSpot(TM) Client VM version 24.55-b03, running a single threaded Java program:
To answer your query
What are they doing and why do they show so little cpu usage and
TIME+?
What part: There are started for a purpose by JVM, as explained above like JVM wants to listen if any profiling or monitoring program wants to gets some details from JVM.
Why part: Because they are really not active or running, they are in wait or parked state (see the Yellow threads in my attached snapshot, if you have a GUI monitoring app then you should also see Yellow or else if command line then threads in WAIT state) and so they are not occupying any or least CPU cycles and hence less CPU usage.
Again TIME+ will show you time for they have been active, and since they are not, this parameter is also less.
Hope this helps!

Most likely threads have been created somewhere and are never used.
For instance:
ExecutorService es = Executors.newFixedThreadPool(12);
// burn cpu, using only one thread (main)
int i = 0;
while(true) {
i++;
}

TIME+ from top is the amount of CPU time spent. One explanation is if a process is constantly blocking or just always blocked than it will will have both a low CPU usage and a low TIME+.

MultiThread and Processor's Capacity

I dont understand why I have 4 Threads RUNNING but only 50% of processor's capacity used : indeed that's mean only 2/4 processors are used.
EDIT : I think this is due to the limit : my mistake is 5 Threads are RUNNING at the same time so by default the System limit the %CPU to 2 cores (50%).
I am going to check about 4 Threads

This very much depends on what your threads are doing.
If the work they are doing is heavily focused on IO operations, then your CPUs can run many many such threads - without getting to any significant CPU load ever.
In other words: most likely, your threads are not doing CPU intensive work.
But we can't know for sure, as you are not giving any hints about the nature of your application.

First, it depends on how many CPU cores do you have - if you have more CPU cores than running threads, then there aren't enough threads to keep all the cores in the processor busy at 100%.
Another thing is that a thread can be in a waiting state eg. waiting on monitor and in such case it doesn't consume the CPU cycles.
In your screenshot one of threads from your pool is in MONITOR state - it is not executing at that moment - it is waiting for something, and as such not consuming CPU cycles.
I think that all the threads in your pool are similar and all share the characteristic of having some potential wait on a monitor - and this limits the possibility of consuming all the CPU cores at 100%.
For example, this simple program should consume all of your cores at 100% simply because it doesn't have any waiting, but if you uncomment the line that orders it to sleep 1 nanosecond Thread.sleep(0, 1); then you will hardly notice any cpu load.
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class consumeMaxOfCPU {
public static void main(String[] args) {
int availableProcessors = Runtime.getRuntime().availableProcessors();
// availableProcessors = availableProcessors /2; // uncomment this line to see around half of the load - because there will be less threads than CPU cores.
ExecutorService pool = Executors.newFixedThreadPool(availableProcessors);
for (int n = 0; n < availableProcessors; n++) {
pool.submit(new HeavyTask(n));
}
}
private static class HeavyTask implements Callable<Long> {
private long n;
public HeavyTask(long n) {
this.n = n;
}
#Override
public Long call() throws Exception {
// there are very little chances that this will finish quickly :)
while (n != -10) {
n = n * n;
// Thread.sleep(0, 1); // uncomment this line to see almost no load because of this sleep.
}
return n;
}
}
}

What really is to “warm up” threads on multithreading processing?

I’m dealing with multithreading in Java and, as someone pointed out to me, I noticed that threads warm up, it is, they get faster as they are repeatedly executed. I would like to understand why this happens and if it is related to Java itself or whether it is a common behavior of every multithreaded program.
The code (by Peter Lawrey) that exemplifies it is the following:
for (int i = 0; i < 20; i++) {
ExecutorService es = Executors.newFixedThreadPool(1);
final double[] d = new double[4 * 1024];
Arrays.fill(d, 1);
final double[] d2 = new double[4 * 1024];
es.submit(new Runnable() {
#Override
public void run() {
// nothing.
}
}).get();
long start = System.nanoTime();
es.submit(new Runnable() {
#Override
public void run() {
synchronized (d) {
System.arraycopy(d, 0, d2, 0, d.length);
}
}
});
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
// get a the values in d2.
for (double x : d2) ;
long time = System.nanoTime() - start;
System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}
Results:
Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.
I.e. it gets faster and stabilises around 50 ns. Why is that?
If I run this code (20 repetitions), then execute something else (lets say postprocessing of the previous results and preparation for another mulithreading round) and later execute the same Runnable on the same ThreadPool for another 20 repetitions, it will be warmed up already, in any case?
On my program, I execute the Runnable in just one thread (actually one per processing core I have, its a CPU-intensive program), then some other serial processing alternately for many times. It doesn’t seem to get faster as the program goes. Maybe I could find a way to warm it up…

It isn't the threads that are warming up so much as the JVM.
The JVM has what's called JIT (Just In Time) compiling. As the program is running, it analyzes what's happening in the program and optimizes it on the fly. It does this by taking the byte code that the JVM runs and converting it to native code that runs faster. It can do this in a way that is optimal for your current situation, as it does this by analyzing the actual runtime behavior. This can (not always) result in great optimization. Even more so than some programs that are compiled to native code without such knowledge.
You can read a bit more at http://en.wikipedia.org/wiki/Just-in-time_compilation
You could get a similar effect on any program as code is loaded into the CPU caches, but I believe this will be a smaller difference.

The only reasons I see that a thread execution can end up being faster are:
The memory manager can reuse already allocated object space (e.g., to let heap allocations fill up the available memory until the max memory is reached - the Xmx property)
The working set is available in the hardware cache
Repeating operations might create operations the compiler can easier reorder to optimize execution

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.