MultiThread and Processor's Capacity

MultiThread and Processor's Capacity - java

I dont understand why I have 4 Threads RUNNING but only 50% of processor's capacity used : indeed that's mean only 2/4 processors are used.
EDIT : I think this is due to the limit : my mistake is 5 Threads are RUNNING at the same time so by default the System limit the %CPU to 2 cores (50%).
I am going to check about 4 Threads

This very much depends on what your threads are doing.
If the work they are doing is heavily focused on IO operations, then your CPUs can run many many such threads - without getting to any significant CPU load ever.
In other words: most likely, your threads are not doing CPU intensive work.
But we can't know for sure, as you are not giving any hints about the nature of your application.

First, it depends on how many CPU cores do you have - if you have more CPU cores than running threads, then there aren't enough threads to keep all the cores in the processor busy at 100%.
Another thing is that a thread can be in a waiting state eg. waiting on monitor and in such case it doesn't consume the CPU cycles.
In your screenshot one of threads from your pool is in MONITOR state - it is not executing at that moment - it is waiting for something, and as such not consuming CPU cycles.
I think that all the threads in your pool are similar and all share the characteristic of having some potential wait on a monitor - and this limits the possibility of consuming all the CPU cores at 100%.
For example, this simple program should consume all of your cores at 100% simply because it doesn't have any waiting, but if you uncomment the line that orders it to sleep 1 nanosecond Thread.sleep(0, 1); then you will hardly notice any cpu load.
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class consumeMaxOfCPU {
public static void main(String[] args) {
int availableProcessors = Runtime.getRuntime().availableProcessors();
// availableProcessors = availableProcessors /2; // uncomment this line to see around half of the load - because there will be less threads than CPU cores.
ExecutorService pool = Executors.newFixedThreadPool(availableProcessors);
for (int n = 0; n < availableProcessors; n++) {
pool.submit(new HeavyTask(n));
}
}
private static class HeavyTask implements Callable<Long> {
private long n;
public HeavyTask(long n) {
this.n = n;
}
#Override
public Long call() throws Exception {
// there are very little chances that this will finish quickly :)
while (n != -10) {
n = n * n;
// Thread.sleep(0, 1); // uncomment this line to see almost no load because of this sleep.
}
return n;
}
}
}

Related

Why does the iteration speed increase over time? [JAVA]

I was playing around with loops in java, when I saw that the iteration speed keeps increasing.
Kind of seemed interesting.
Any ideas why?
Code:
import org.junit.jupiter.api.Test;
public class RandomStuffTest {
public static long iterationsPerSecond = 0;
#Test
void testIterationSpeed() {
Thread t = new Thread(()->{
try{
while (true){
System.out.println("Iterations per second: "+iterationsPerSecond);
iterationsPerSecond = 0;
Thread.sleep(1000);
}
} catch (Exception e) {
e.printStackTrace();
}
});
t.setDaemon(true);
t.start();
while (true){
for (long i = 0; i < Long.MAX_VALUE; i++) {
iterationsPerSecond++;
}
}
}
}
Output:
Iterations per second: 6111
Iterations per second: 2199824206
Iterations per second: 4539572003
Iterations per second: 6919540856
Iterations per second: 9442209284
Iterations per second: 11899448226
Iterations per second: 14313220638
Iterations per second: 16827637088
Iterations per second: 19322118707
Iterations per second: 21807781722
Iterations per second: 24256315314
Iterations per second: 26641505580
Another thing that I noticed:
The CPU usage was around 20% all the time and not really increasing...
Maybe because I was running the code as a test using Junit?

The problem is the Java Memory Model (JMM).
Every thread is allowed to have (does not have to do this) a local copy of each field. Whenever it writes or reads this field it is free to just set its local copy and sync it up with other threads' local copies much, much later.
Said differently, the JVM is free to re-order instructions, do things in parallel, and otherwise apply whatever weird stuff it wants to optimize your code, as long as certain guarantees are never broken.
One guarantee that is easy to understand: The JVM is free to reorder or parallelize 2 sequential instructions, but it must never be possible to write code that can observe this except through timing.
In other words, int x = 0; x = 5; System.out.println(x); must necessarily print 5 and never 0.
You can establish such relationships between 2 threads as well but this involves the use of volatile and/or synchronized and/or something that does this internally (most things in the java.util.concurrent package).
You didn't, so this result is meaningless. Most likely, the instruction iterationsPerSecond = 0 is having no effect; the code iterationsPerSecond++ reads 9442209284, increments by one, and writes it back - and that field got written to 0 someplace in the middle of all that, which thus accomplished nothing whatsoever.
If you want to test this properly, try a volatile variable, or better yet an AtomicLong.

Like already indicated, the code is broken due to a data race.
The JIT can do some funny stuff with your code because of the data race:
while (true){
for (long i = 0; i < Long.MAX_VALUE; i++) {
iterationsPerSecond++;
}
}
Since it doesn't know that another thread is also messing with the iterationsPerSecond, the compiler could fold the for loop because it can calculate the outcome of the loop:
while (true){
iterationsPerSecond=Long.MAX_VALUE
}
And it could even decide to pull out the write of the loop since the same value is written (loop invariant code motion):
iterationsPerSecond=Long.MAX_VALUE
while (true){
}
It could even decide the throw away the store, because it doesn't know there are any readers. So effectively it is a dead store and hence it can apply dead code elimination.
while (true){
}
An atomic or volatile would solve the problem because a happens before edge is established. Using a volatile or an atomiclong.get/set is equally expensive. It has the same compiler restrictions and fences on hardware level.
If you want to run microbenchmarks, I would suggest checking out JMH. It will protect you against a lot of trivial mistakes.

Amount of Threads with subtasks

An optimum of threads in a pool is something that is case specific, though there is a rule of thumb which says #threads = #CPU +1.
However, how does this work with threads spanning other threads and waiting (i.e. blocked until thread.join() is successful) for these 'subthreads'?
Assume that I have code that requires the execution of list of tasks (2), which has subtasks(2), which has subsubtasks(3) and so on. The total number of tasks is 2*2*3 = 12, though 18 threads will be created (because a threads will 'spawn' more subtasks (threads), where the thread spawning more threads will be blocked untill all is over. See below for pseudo code.
I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (12) is #CPU + 1. Is this correct?
PseudoCode
outputOfTask = []
for subtask in SubTaskList
outputOfTask --> append(subtask.doCompute())
// wait untill all output is finished.
in subtask.java:
Each subtask, for example, implements the same interface, but can be different.
outputOfSubtask = []
for task in subsubTaskList
// do some magic depending on the type of subtask
outputOfSubtask -> append( task.doCompute())
return outputOfSubtask
in subsubtask.java:
outputOfSubsubtask = []
for task in subsubsubtask
// do some magic depending on the type of subsubtask
outputOfSubsubtask -> append( task.doCompute())
return outputOfSubsubtask
EDIT:
Dummy code Java code. I used this in my original question to check how many threads were active, but I assume that the pseudocode is more clear. Please note: I used the Eclipse Collection, this introduces the asParallel function which allows for a shorter notation of the code.
#Test
public void testasParallelthreads() {
// // ExecutorService executor = Executors.newWorkStealingPool();
ExecutorService executor = Executors.newCachedThreadPool();
MutableList<Double> myMainTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubSubTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubSubSubTask = Lists.mutable.with(1.0, 2.0, 2.0);
MutableList<Double> a = myMainTask.asParallel(executor, 1)
.flatCollect(task -> mySubTask.asParallel(executor,1)
.flatCollect(subTask -> mySubSubTask.asParallel(executor, 1)
.flatCollect(subsubTask -> mySubSubSubTask.asParallel(executor, 1)
.flatCollect(subsubTask -> dummyFunction(task, subTask, subsubTask, subsubTask,executor))
.toList()).toList()).toList()).toList();
System.out.println("pool size: " + ((ThreadPoolExecutor) executor).getPoolSize());
executor.shutdownNow();
}
private MutableList<Double> dummyFunction(double a, double b, double c, double d, ExecutorService ex) {
System.out.println("ThreadId: " + Thread.currentThread().getId());
System.out.println("Active threads size: " + ((ThreadPoolExecutor) ex).getActiveCount());
return Lists.mutable.with(a,b,c,d);
}

I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (12) is #CPU + 1. Is this correct?
This topic is extremely hard to generalize about. Even with the actual code, the performance of your application is going to be very difficult to determine. Even if you could come up an estimation, the actual performance may vary wildly between runs – especially considering that the threads are interacting with each other. The only time we can take the #CPU + 1 number is if the jobs that are submitted into the thread-pool are independent and completely CPU bound.
I'd recommend trying a number of different thread-pool size values under simulated load to find the optimal values for your application. Examining the overall throughput numbers or system load stats should give you the feedback you need.

However, how does this work with threads spanning other threads and waiting (i.e. blocked until thread.join() is successful) for these 'subthreads'?
Threads will block, and it is up to the os/jvm to schedule another one if possible. If you have a single thread pool executor and call join from one of your tasks, the other task won't even get started. With executors that use more threads, then the blocking task will block a single thread and the os/jvm is free to scheduled other threads.
These blocked threads should not consume CPU time, because they are blocked. So I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (24) is #CPU + 1. Is this correct?
Active threads can be blocking. I think you're mixing terms here, #CPU, the number of cores, and the number of virtual cores. If you have N physical cores, then you can run N cpu bound tasks in parallel. When you have other types of blocking or very short lived tasks, then you can have more parallel tasks.

Why is there more than one thread? [duplicate]

I have a java program which has 13 threads, though only one of them is at 99% cpu usage and has been running for ~24 hours. The others are at 0.0% cpu usage and show a TIME+ of anywhere from 0:00.0 to 0:12.82 and one has 3:51.48. The program is intended to be a single threaded program, so I'm wondering why the other threads are there?
What are they doing and why do they show so little cpu usage and TIME+?
UPDATE: I have an old java program I wrote (first program - don't judge me!) which is single threaded and shows the same type of thread usage ...
import java.io.*;
class xdriver {
static int N = 100;
static double pi = 3.141592653589793;
static double one = 1.0;
static double two = 2.0;
public static void main(String[] args) {
//System.out.println("Program has started successfully\n");
if( args.length == 1) {
// assume that args[0] is an integer
N = Integer.parseInt(args[0]);
}
// maybe we can get user input later on this ...
int nr = N;
int nt = N;
int np = 2*N;
double dr = 1.0/(double)(nr-1);
double dt = pi/(double)(nt-1);
double dp = (two*pi)/(double)(np-1);
System.out.format("nn --> %d\n", nr*nt*np);
if(nr*nt*np < 0) {
System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n", (long)( (long)nr*(long)nt*(long)np), nr*nt*np);
System.exit(1);
}
// inserted to artificially blow up RAM
double[][] dels = new double [nr*nt*np][3];
double[] rs = new double[nr];
double[] ts = new double[nt];
double[] ps = new double[np];
for(int ir = 0; ir < nr; ir++) {
rs[ir] = dr*(double)(ir);
}
for(int it = 0; it < nt; it++) {
ts[it] = dt*(double)(it);
}
for(int ip = 0; ip < np; ip++) {
ps[ip] = dp*(double)(ip);
}
double C = (4.0/3.0)*pi;
C = one/C;
double fint = 0.0;
int ii = 0;
for(int ir = 0; ir < nr; ir++) {
double r = rs[ir];
double r2dr = r*r*dr;
for(int it = 0; it < nt; it++) {
double t = ts[it];
double sint = Math.sin(t);
for(int ip = 0; ip < np; ip++) {
fint += C*r2dr*sint*dt*dp;
dels[ii][0] = dr;
dels[ii][1] = dt;
dels[ii][2] = dp;
}
}
}
System.out.format("N ........ %d\n", N);
System.out.format("fint ..... %15.10f\n", fint);
System.out.format("err ...... %15.10f\n", Math.abs(1.0-fint));
}
}

Starting a Java program means starting a JVM and telling it which main class to run (that usually has a static main method).
This JVM spawns several background threads in addition to the above mentioned main thread.
Among them are
The VM thread: An observing thread that waits for tasks that require the VM to be at a safe point. For example, there is a garbage collection task that fully "stops the world". But there are others.
GC threads: Several threads that are maintained to run the garbage collection.
Compile threads: These threads are used to compile byte code into native machine code.
There might be many more.
Additionally, if you are using AWT or Swing, you will get some more threads from these frameworks. One of them is the so-called event dispatcher thred (EDT). And - of course - there might be thread that you did create and run: timers, executors, or simply some arbitrary threads. Even for a simple hello world application there might be a dozen of threads running.
But most of these threads are more waiting than doing something. So chances are high that only one thread is really working, thus utilizing some CPU.
Although ... 100% CPU utilization might be an indicator of some problem. A never-ending loop, for example. You have to use a profiler for finding out what really happens. But it could be simply a program that has such a CPU utilitation. Your judgement.

Quoting the discussion done here and other research.
Few of core JVM threads:
Attach Listener: This is thread which always listens for other JVM threads to send a request. A practical example is in case of profiling or (I am not about this though) production level application monitoring tools like DynaTrace.
Signal Dispatcher: When the OS raises a signal to the JVM, the signal dispatcher thread will pass the signal to the appropriate handler.
Reference Handler: High-priority thread to enqueue pending References. The GC creates a simple linked list of references which need to be processed and this thread quickly adds them to a proper queue and notifies ReferenceQueue listeners.
Finalizer: The Finalizer thread calls finalizer methods.
DestroyJavaVM: This thread unloads the Java VM on program exit. Most of the time it should be waiting.
Garbage Collector: Thread responsible for Java's garbage collection mechanism, depending upon if GC is enabled or not.
main: Main thread running your program containing main method.
One important point to note is that it will depend upon JVM implementations that how many and which all core threads it will start but even if Java program is written to be a single threaded, there would be more than one thread in JVM.
Java program can be single threaded but JVM (which will run user defined Java program) is multi-threaded and will have (at-least for latest JVMs) more than one thread even from start.
Below is a snapshot of my Java HotSpot(TM) Client VM version 24.55-b03, running a single threaded Java program:
To answer your query
What are they doing and why do they show so little cpu usage and
TIME+?
What part: There are started for a purpose by JVM, as explained above like JVM wants to listen if any profiling or monitoring program wants to gets some details from JVM.
Why part: Because they are really not active or running, they are in wait or parked state (see the Yellow threads in my attached snapshot, if you have a GUI monitoring app then you should also see Yellow or else if command line then threads in WAIT state) and so they are not occupying any or least CPU cycles and hence less CPU usage.
Again TIME+ will show you time for they have been active, and since they are not, this parameter is also less.
Hope this helps!

Most likely threads have been created somewhere and are never used.
For instance:
ExecutorService es = Executors.newFixedThreadPool(12);
// burn cpu, using only one thread (main)
int i = 0;
while(true) {
i++;
}

TIME+ from top is the amount of CPU time spent. One explanation is if a process is constantly blocking or just always blocked than it will will have both a low CPU usage and a low TIME+.

Java thread executing remainder operation in a loop blocks all other threads

The following code snippet executes two threads, one is a simple timer logging every second, the second is an infinite loop that executes a remainder operation:
public class TestBlockingThread {
private static final Logger LOGGER = LoggerFactory.getLogger(TestBlockingThread.class);
public static final void main(String[] args) throws InterruptedException {
Runnable task = () -> {
int i = 0;
while (true) {
i++;
if (i != 0) {
boolean b = 1 % i == 0;
}
}
};
new Thread(new LogTimer()).start();
Thread.sleep(2000);
new Thread(task).start();
}
public static class LogTimer implements Runnable {
#Override
public void run() {
while (true) {
long start = System.currentTimeMillis();
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
// do nothing
}
LOGGER.info("timeElapsed={}", System.currentTimeMillis() - start);
}
}
}
}
This gives the following result:
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1004
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1003
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=13331
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1006
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1003
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1004
[Thread-0] INFO c.m.c.concurrent.TestBlockingThread - timeElapsed=1004
I don't understand why the infinite task blocks all other threads for 13.3 seconds. I tried to change thread priorities and other settings, nothing worked.
If you have any suggestions to fix this (including tweaking OS context switching settings) please let me know.

After all the explanations here (thanks to Peter Lawrey) we found that the main source of this pause is that safepoint inside the loop is reached rather rarely so it takes a long time to stop all threads for JIT-compiled code replacement.
But I decided to go deeper and find why safepoint is reached rarely. I found it a bit confusing why the back jump of while loop is not "safe" in this case.
So I summon -XX:+PrintAssembly in all its glory to help
-XX:+UnlockDiagnosticVMOptions \
-XX:+TraceClassLoading \
-XX:+DebugNonSafepoints \
-XX:+PrintCompilation \
-XX:+PrintGCDetails \
-XX:+PrintStubCode \
-XX:+PrintAssembly \
-XX:PrintAssemblyOptions=-Mintel
After some investigation I found that after third recompilation of lambda C2 compiler threw away safepoint polls inside loop completely.
UPDATE
During the profiling stage variable i was never seen equal to 0. That's why C2 speculatively optimized this branch away, so that the loop was transformed to something like
for (int i = OSR_value; i != 0; i++) {
if (1 % i == 0) {
uncommon_trap();
}
}
uncommon_trap();
Note that originally infinite loop was reshaped to a regular finite loop with a counter! Due to JIT optimization to eliminate safepoint polls in finite counted loops, there was no safepoint poll in this loop either.
After some time, i wrapped back to 0, and the uncommon trap was taken. The method was deoptimized and continued execution in the interpreter. During recompilation with a new knowledge C2 recognized the infinite loop and gave up compilation. The rest of the method proceeded in the interpreter with proper safepoints.
There is a great must-read blog post "Safepoints: Meaning, Side Effects and Overheads" by Nitsan Wakart covering safepoints and this particular issue.
Safepoint elimination in very long counted loops is known to be a problem. The bug JDK-5014723 (thanks to Vladimir Ivanov) addresses this problem.
The workaround is available until the bug is finally fixed.
You can try using -XX:+UseCountedLoopSafepoints (it will cause overall performance penalty and may lead to JVM crash JDK-8161147). After using it C2 compiler continue keeping safepoints at the back jumps and original pause disappears completely.
You can explicitly disable compilation of problematic method by using
-XX:CompileCommand='exclude,binary/class/Name,methodName'
Or you can rewrite your code by adding safepoint manually. For example Thread.yield() call at the end of cycle or even changing int i to long i (thanks, Nitsan Wakart) will also fix pause.

In short, the loop you have has no safe point inside it except when i == 0 is reached. When this method is compiled and triggers the code to be replaced it needs to bring all the threads to a safe point, but this takes a very long time, locking up not just the thread running the code but all threads in the JVM.
I added the following command line options.
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintCompilation
I also modified the code to use floating point which appears to take longer.
boolean b = 1.0 / i == 0;
And what I see in the output is
timeElapsed=100
Application time: 0.9560686 seconds
41423 280 % 4 TestBlockingThread::lambda$main$0 # -2 (27 bytes) made not entrant
Total time for which application threads were stopped: 40.3971116 seconds, Stopping threads took: 40.3967755 seconds
Application time: 0.0000219 seconds
Total time for which application threads were stopped: 0.0005840 seconds, Stopping threads took: 0.0000383 seconds
41424 281 % 3 TestBlockingThread::lambda$main$0 # 2 (27 bytes)
timeElapsed=40473
41425 282 % 4 TestBlockingThread::lambda$main$0 # 2 (27 bytes)
41426 281 % 3 TestBlockingThread::lambda$main$0 # -2 (27 bytes) made not entrant
timeElapsed=100
Note: for code to be replaced, threads have to be stopped at a safe point. However it appears here that such a safe point is reached very rarely (possibly only when i == 0 Changing the task to
Runnable task = () -> {
for (int i = 1; i != 0 ; i++) {
boolean b = 1.0 / i == 0;
}
};
I see a similar delay.
timeElapsed=100
Application time: 0.9587419 seconds
39044 280 % 4 TestBlockingThread::lambda$main$0 # -2 (28 bytes) made not entrant
Total time for which application threads were stopped: 38.0227039 seconds, Stopping threads took: 38.0225761 seconds
Application time: 0.0000087 seconds
Total time for which application threads were stopped: 0.0003102 seconds, Stopping threads took: 0.0000105 seconds
timeElapsed=38100
timeElapsed=100
Adding code to the loop carefully you get a longer delay.
for (int i = 1; i != 0 ; i++) {
boolean b = 1.0 / i / i == 0;
}
gets
Total time for which application threads were stopped: 59.6034546 seconds, Stopping threads took: 59.6030773 seconds
However, change the code to use a native method which always has a safe point (if it is not an intrinsic)
for (int i = 1; i != 0 ; i++) {
boolean b = Math.cos(1.0 / i) == 0;
}
prints
Total time for which application threads were stopped: 0.0001444 seconds, Stopping threads took: 0.0000615 seconds
Note: adding if (Thread.currentThread().isInterrupted()) { ... } to a loop adds a safe point.
Note: This happened on a 16 core machine so there is no lack of CPU resources.

Found the answer of why. They are called safepoints, and are best known as the Stop-The-World that happens because of GC.
See this articles: Logging stop-the-world pauses in JVM
Different events can cause the JVM to pause all the application threads. Such pauses are called Stop-The-World (STW) pauses. The most common cause for an STW pause to be triggered is garbage collection (example in github) , but different JIT actions (example), biased lock revocation (example), certain JVMTI operations , and many more also require the application to be stopped.
The points at which the application threads may be safely stopped are called, surprise, safepoints. This term is also often used to refer to all the STW pauses.
It is more or less common that GC logs are enabled. However, this does not capture information on all the safepoints. To get it all, use these JVM options:
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
If you are wondering about the naming explicitly referring to GC, don’t be alarmed – turning on these options logs all of the safepoints, not just garbage collection pauses. If you run a following example (source in github) with the flags specified above.
Reading the HotSpot Glossary of Terms, it defines this:
safepoint
A point during program execution at which all GC roots are known and all heap object contents are consistent. From a global point of view, all threads must block at a safepoint before the GC can run. (As a special case, threads running JNI code can continue to run, because they use only handles. During a safepoint they must block instead of loading the contents of the handle.) From a local point of view, a safepoint is a distinguished point in a block of code where the executing thread may block for the GC. Most call sites qualify as safepoints. There are strong invariants which hold true at every safepoint, which may be disregarded at non-safepoints. Both compiled Java code and C/C++ code be optimized between safepoints, but less so across safepoints. The JIT compiler emits a GC map at each safepoint. C/C++ code in the VM uses stylized macro-based conventions (e.g., TRAPS) to mark potential safepoints.
Running with the above mentioned flags, I get this output:
Application time: 0.9668750 seconds
Total time for which application threads were stopped: 0.0000747 seconds, Stopping threads took: 0.0000291 seconds
timeElapsed=1015
Application time: 1.0148568 seconds
Total time for which application threads were stopped: 0.0000556 seconds, Stopping threads took: 0.0000168 seconds
timeElapsed=1015
timeElapsed=1014
Application time: 2.0453971 seconds
Total time for which application threads were stopped: 10.7951187 seconds, Stopping threads took: 10.7950774 seconds
timeElapsed=11732
Application time: 1.0149263 seconds
Total time for which application threads were stopped: 0.0000644 seconds, Stopping threads took: 0.0000368 seconds
timeElapsed=1015
Notice the third STW event:
Total time stopped: 10.7951187 seconds
Stopping threads took: 10.7950774 seconds
JIT itself took virtually no time, but once the JVM had decided to perform a JIT compilation, it entered STW mode, however since the code to be compiled (the infinite loop) doesn't have a call site, no safepoint was ever reached.
The STW ends when JIT eventually gives up waiting and concludes the code is in an infinite loop.

After following the comment threads and some testing on my own, I believe that the pause is caused by the JIT compiler. Why the JIT compiler is taking such a long time is beyond my ability to debug.
However, since you only asked for how to prevent this, I have a solution:
Pull your infinite loop into a method where it can be excluded from the JIT compiler
public class TestBlockingThread {
private static final Logger LOGGER = Logger.getLogger(TestBlockingThread.class.getName());
public static final void main(String[] args) throws InterruptedException {
Runnable task = () -> {
infLoop();
};
new Thread(new LogTimer()).start();
Thread.sleep(2000);
new Thread(task).start();
}
private static void infLoop()
{
int i = 0;
while (true) {
i++;
if (i != 0) {
boolean b = 1 % i == 0;
}
}
}
Run your program with this VM argument:
-XX:CompileCommand=exclude,PACKAGE.TestBlockingThread::infLoop (replace PACKAGE with your package information)
You should get a message like this to indicate when the method would have been JIT-compiled: ### Excluding compile: static blocking.TestBlockingThread::infLoop
you may notice that I put the class into a package called blocking

What really is to “warm up” threads on multithreading processing?

I’m dealing with multithreading in Java and, as someone pointed out to me, I noticed that threads warm up, it is, they get faster as they are repeatedly executed. I would like to understand why this happens and if it is related to Java itself or whether it is a common behavior of every multithreaded program.
The code (by Peter Lawrey) that exemplifies it is the following:
for (int i = 0; i < 20; i++) {
ExecutorService es = Executors.newFixedThreadPool(1);
final double[] d = new double[4 * 1024];
Arrays.fill(d, 1);
final double[] d2 = new double[4 * 1024];
es.submit(new Runnable() {
#Override
public void run() {
// nothing.
}
}).get();
long start = System.nanoTime();
es.submit(new Runnable() {
#Override
public void run() {
synchronized (d) {
System.arraycopy(d, 0, d2, 0, d.length);
}
}
});
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
// get a the values in d2.
for (double x : d2) ;
long time = System.nanoTime() - start;
System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}
Results:
Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.
I.e. it gets faster and stabilises around 50 ns. Why is that?
If I run this code (20 repetitions), then execute something else (lets say postprocessing of the previous results and preparation for another mulithreading round) and later execute the same Runnable on the same ThreadPool for another 20 repetitions, it will be warmed up already, in any case?
On my program, I execute the Runnable in just one thread (actually one per processing core I have, its a CPU-intensive program), then some other serial processing alternately for many times. It doesn’t seem to get faster as the program goes. Maybe I could find a way to warm it up…

It isn't the threads that are warming up so much as the JVM.
The JVM has what's called JIT (Just In Time) compiling. As the program is running, it analyzes what's happening in the program and optimizes it on the fly. It does this by taking the byte code that the JVM runs and converting it to native code that runs faster. It can do this in a way that is optimal for your current situation, as it does this by analyzing the actual runtime behavior. This can (not always) result in great optimization. Even more so than some programs that are compiled to native code without such knowledge.
You can read a bit more at http://en.wikipedia.org/wiki/Just-in-time_compilation
You could get a similar effect on any program as code is loaded into the CPU caches, but I believe this will be a smaller difference.

The only reasons I see that a thread execution can end up being faster are:
The memory manager can reuse already allocated object space (e.g., to let heap allocations fill up the available memory until the max memory is reached - the Xmx property)
The working set is available in the hardware cache
Repeating operations might create operations the compiler can easier reorder to optimize execution

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.