Why does the iteration speed increase over time? [JAVA]

Why does the iteration speed increase over time? [JAVA] - java

I was playing around with loops in java, when I saw that the iteration speed keeps increasing.
Kind of seemed interesting.
Any ideas why?
Code:
import org.junit.jupiter.api.Test;
public class RandomStuffTest {
public static long iterationsPerSecond = 0;
#Test
void testIterationSpeed() {
Thread t = new Thread(()->{
try{
while (true){
System.out.println("Iterations per second: "+iterationsPerSecond);
iterationsPerSecond = 0;
Thread.sleep(1000);
}
} catch (Exception e) {
e.printStackTrace();
}
});
t.setDaemon(true);
t.start();
while (true){
for (long i = 0; i < Long.MAX_VALUE; i++) {
iterationsPerSecond++;
}
}
}
}
Output:
Iterations per second: 6111
Iterations per second: 2199824206
Iterations per second: 4539572003
Iterations per second: 6919540856
Iterations per second: 9442209284
Iterations per second: 11899448226
Iterations per second: 14313220638
Iterations per second: 16827637088
Iterations per second: 19322118707
Iterations per second: 21807781722
Iterations per second: 24256315314
Iterations per second: 26641505580
Another thing that I noticed:
The CPU usage was around 20% all the time and not really increasing...
Maybe because I was running the code as a test using Junit?

The problem is the Java Memory Model (JMM).
Every thread is allowed to have (does not have to do this) a local copy of each field. Whenever it writes or reads this field it is free to just set its local copy and sync it up with other threads' local copies much, much later.
Said differently, the JVM is free to re-order instructions, do things in parallel, and otherwise apply whatever weird stuff it wants to optimize your code, as long as certain guarantees are never broken.
One guarantee that is easy to understand: The JVM is free to reorder or parallelize 2 sequential instructions, but it must never be possible to write code that can observe this except through timing.
In other words, int x = 0; x = 5; System.out.println(x); must necessarily print 5 and never 0.
You can establish such relationships between 2 threads as well but this involves the use of volatile and/or synchronized and/or something that does this internally (most things in the java.util.concurrent package).
You didn't, so this result is meaningless. Most likely, the instruction iterationsPerSecond = 0 is having no effect; the code iterationsPerSecond++ reads 9442209284, increments by one, and writes it back - and that field got written to 0 someplace in the middle of all that, which thus accomplished nothing whatsoever.
If you want to test this properly, try a volatile variable, or better yet an AtomicLong.

Like already indicated, the code is broken due to a data race.
The JIT can do some funny stuff with your code because of the data race:
while (true){
for (long i = 0; i < Long.MAX_VALUE; i++) {
iterationsPerSecond++;
}
}
Since it doesn't know that another thread is also messing with the iterationsPerSecond, the compiler could fold the for loop because it can calculate the outcome of the loop:
while (true){
iterationsPerSecond=Long.MAX_VALUE
}
And it could even decide to pull out the write of the loop since the same value is written (loop invariant code motion):
iterationsPerSecond=Long.MAX_VALUE
while (true){
}
It could even decide the throw away the store, because it doesn't know there are any readers. So effectively it is a dead store and hence it can apply dead code elimination.
while (true){
}
An atomic or volatile would solve the problem because a happens before edge is established. Using a volatile or an atomiclong.get/set is equally expensive. It has the same compiler restrictions and fences on hardware level.
If you want to run microbenchmarks, I would suggest checking out JMH. It will protect you against a lot of trivial mistakes.

Related

Threads run in serial not parallel

I am trying to learn concurrency in Java, but whatever I do, 2 threads run in serial, not parallel, so I am not able to replicate common concurrency issues explained in tutorials (like thread interference and memory consistency errors). Sample code:
public class Synchronization {
static int v;
public static void main(String[] args) {
Runnable r0 = () -> {
for (int i = 0; i < 10; i++) {
Synchronization.v++;
System.out.println(v);
}
};
Runnable r1 = () -> {
for (int i = 0; i < 10; i++) {
Synchronization.v--;
System.out.println(v);
}
};
Thread t0 = new Thread(r0);
Thread t1 = new Thread(r1);
t0.start();
t1.start();
}
}
This always give me a result starting from 1 and ending with 0 (whatever the loop length is). For example, the code above gives me every time:
1
2
3
4
5
6
7
8
9
10
9
8
7
6
5
4
3
2
1
0
Sometimes, the second thread starts first and the results are the same but negative, so it is still running in serial.
Tried in both Intellij and Eclipse with identical results. CPU has 2 cores if it matters.
UPDATE: it finally became reproducible with huge loops (starting from 1_000_000), though still not every time and just with small amount of final discrepancy. Also seems like making operations in loops "heavier", like printing thread name makes it more reproducible as well. Manually adding sleep to thread also works, but it makes experiment less cleaner, so to say. The reason doesn't seems to be that first loop finishes before the second starts, because I see both loops printing to console while continuing operating and still giving me 0 at the end. The reasons seems more like a thread race for same variable. I will dig deeper into that, thanks.

Seems like first started thread just never give a chance to second in Thread Race to take a variable/second one just never have a time to even start (couldn't say for sure), so the second almost* always will be waiting until first loop will be finished.
Some heavy operation will mix the result:
TimeUnit.MILLISECONDS.sleep(100);
*it is not always true, but you are was lucky in your tests

Starting a thread is heavyweight operation, meaning that it will take some time to perform. Due that fact, by the time you start second thread, first is finished.
The reasoning why sometimes it is in "revert order" is due how thread scheduler works. By the specs there are not guarantees about thread execution order - having that in mind, we know that it is possible for second thread to run first (and finish)
Increase iteration count to something meaningful like 10000 and see what will happen then.

This is called lucky timing as per Brian Goetz (Author of Java Concurrency In Practice). Since there is no synchronization to the static variable v it is clear that this class is not thread-safe.

Making computation-threads cancellable in a smart way

I am wondering how to reach a compromise between fast-cancel-responsiveness and performance with my threads which body look similar to this loop:
for(int i=0; i<HUGE_NUMBER; ++i) {
//some easy computation like adding numbers
//which are result of previous iteration of this loop
}
If a computation in loop body is quite easy then adding simple check-reaction to each iteration:
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Cancelled");
}
may slow down execution of the code.
Even if I change the above condition to:
if (i % 100 && Thread.currentThread().isInterrupted()) {
throw new InterruptedException("Cancelled");
}
Then compilator cannot just precompute values of i and check condition only in some specific situations since HUGE_NUMBER is variable and can have different values.
So I'd like to ask if there's any smart way of adding such check to a presented code knowing that:
HUGE_NUMBER is variable and can have different values
loop body consists of some easy-to-compute, but relying on prevoius computations code.
What I want to say is that one iteration of a loop is quite fast, but HUGE_NUMBER of iterations can take a little more time and this is what I want to avoid.

First of all, use Thread.interrupted() instead of Thread.currentThread().isInterrupted() in that case.
You should think about if checking the interruption flag really slows down your calculation too much! One the one hand, if the loop body is VERY simple, even a huge number of iterations (the upper limit is Integer.MAX_VALUE) will run in a few seconds. Even when checking the interruption flag will result in an overhead of 20 or 30%, this will not add very much to the total runtime of your algorithm.
On the other hand, if the loop body is not that simple and so it will run longer, testing the interruption flag will not be a remarkable overhead I think.
Don't do tricks like if (i % 10000 == 0), as this will slow down calculation much more than a 'short' Thread.interrupted().
There is one small trick that you could use - but think twice because it makes your code more complex and less readable:
Whenever you have a loop like that:
for (int i = 0; i < max; i++) {
// loop-body using i
}
you can split up the total range of i into several intervals of size INTERVAL_SIZE:
int start = 0;
while (start < max) {
final int next = Math.min(start + INTERVAL_SIZE, max);
for(int i = start; i < next; i++) {
// loop-body using i
}
start = next;
}
Now you can add your interruption check right before or after the inner loop!
I've done some tests on my system (JDK 7) using the following loop-body
if (i % 2 == 0) x++;
and Integer.MAX_VALUE / 2 iterations. The results are as follows (after warm-up):
Simple loop without any interruption checks: 1,949 ms
Simple loop with check per iteration: 2,219 ms (+14%)
Simple loop with check per 1 million-th iteration using modulo: 3,166 ms (+62%)
Simple loop with check per 1 million-th iteration using bit-mask: 2,653 ms (+36%)
Interval-loop as described above with check in outer loop: 1,972 ms (+1.1%)
So even if the loop-body is as simple as above, the overhead for a per-iteration check is only 14%! So it's recommended to not do any tricks but simply check the interruption flag via Thread.interrupted() in every iteration!

Make your calculation an Iterator.
Although this does not sound terribly useful the benefit here is that you can then quite easily write filter iterators that can be surprisingly flexible. They can be added and removed simply - even through configuration if you wish. There are a number of benefits - try it.
You can then add a filtering Iterator that watches the time and checks for interrupt on a regular basis - or something even more flexible.
You can even add further filtering without compromising the original calculation by interspersing it with brittle status checks.

Starvation in non-blocking approaches

I've been reading about non-blocking approaches for some time. Here is a piece of code for so called lock-free counter.
public class CasCounter {
private SimulatedCAS value;
public int getValue() {
return value.get();
}
public int increment() {
int v;
do {
v = value.get();
}
while (v != value.compareAndSwap(v, v + 1));
return v + 1;
}
}
I was just wondering about this loop:
do {
v = value.get();
}
while (v != value.compareAndSwap(v, v + 1));
People say:
So it tries again, and again, until all other threads trying to change the value have done so. This is lock free as no lock is used, but not blocking free as it may have to try again (which is rare) more than once (very rare).
My question is:
How can they be so sure about that? As for me I can't see any reason why this loop can't be infinite, unless JVM has some special mechanisms to solve this.

The loop can be infinite (since it can generate starvation for your thread), but the likelihood for that happening is very small. In order for you to get starvation you need some other thread succeeding in changing the value that you want to update between your read and your store and for that to happen repeatedly.
It would be possible to write code to trigger starvation but for real programs it would be unlikely to happen.
The compare and swap is usually used when you don't think you will have write conflicts very often. Say there is a 50% chance of "miss" when you update, then there is a 25% chance that you will miss in two loops and less than 0.1% chance that no update would succeed in 10 loops. For real world examples, a 50% miss rate is very high (basically not doing anything than updating), and as the miss rate is reduces, to say 1% then the risk of not succeeding in two tries is only 0.01% and in 3 tries 0.0001%.
The usage is similar to the following problem
Set a variable a to 0 and have two threads updating it with a = a+1 a million times each concurrently. At the end a could have any answer between 1000000 (every other update was lost due to overwrite) and 2000000 (no update was overwritten).
The closer to 2000000 you get the more likely the CAS usage is to work since that mean that quite often the CAS would see the expected value and be able to set with the new value.

Edit: I think I have a satisfactory answer now. The bit that confused me was the 'v != compareAndSwap'. In the actual code, CAS returns true if the value is equal to the compared expression. Thus, even if the first thread is interrupted between get and CAS, the second thread will succeed the swap and exit the method, so the first thread will be able to do the CAS.
Of course, it is possible that if two threads call this method an infinite number of times, one of them will not get the chance to run the CAS at all, especially if it has a lower priority, but this is one of the risks of unfair locking (the probability is very low however). As I've said, a queue mechanism would be able to solve this problem.
Sorry for the initial wrong assumptions.

What really is to “warm up” threads on multithreading processing?

I’m dealing with multithreading in Java and, as someone pointed out to me, I noticed that threads warm up, it is, they get faster as they are repeatedly executed. I would like to understand why this happens and if it is related to Java itself or whether it is a common behavior of every multithreaded program.
The code (by Peter Lawrey) that exemplifies it is the following:
for (int i = 0; i < 20; i++) {
ExecutorService es = Executors.newFixedThreadPool(1);
final double[] d = new double[4 * 1024];
Arrays.fill(d, 1);
final double[] d2 = new double[4 * 1024];
es.submit(new Runnable() {
#Override
public void run() {
// nothing.
}
}).get();
long start = System.nanoTime();
es.submit(new Runnable() {
#Override
public void run() {
synchronized (d) {
System.arraycopy(d, 0, d2, 0, d.length);
}
}
});
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
// get a the values in d2.
for (double x : d2) ;
long time = System.nanoTime() - start;
System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}
Results:
Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.
I.e. it gets faster and stabilises around 50 ns. Why is that?
If I run this code (20 repetitions), then execute something else (lets say postprocessing of the previous results and preparation for another mulithreading round) and later execute the same Runnable on the same ThreadPool for another 20 repetitions, it will be warmed up already, in any case?
On my program, I execute the Runnable in just one thread (actually one per processing core I have, its a CPU-intensive program), then some other serial processing alternately for many times. It doesn’t seem to get faster as the program goes. Maybe I could find a way to warm it up…

It isn't the threads that are warming up so much as the JVM.
The JVM has what's called JIT (Just In Time) compiling. As the program is running, it analyzes what's happening in the program and optimizes it on the fly. It does this by taking the byte code that the JVM runs and converting it to native code that runs faster. It can do this in a way that is optimal for your current situation, as it does this by analyzing the actual runtime behavior. This can (not always) result in great optimization. Even more so than some programs that are compiled to native code without such knowledge.
You can read a bit more at http://en.wikipedia.org/wiki/Just-in-time_compilation
You could get a similar effect on any program as code is loaded into the CPU caches, but I believe this will be a smaller difference.

The only reasons I see that a thread execution can end up being faster are:
The memory manager can reuse already allocated object space (e.g., to let heap allocations fill up the available memory until the max memory is reached - the Xmx property)
The working set is available in the hardware cache
Repeating operations might create operations the compiler can easier reorder to optimize execution

Java ConcurrentHashMap

In an application where 1 thread is responsible for updating a map continuously and the main thread periodically reads the map, is it sufficient to use a ConcurrentHashmap? Or should I explicitly lock operations in synchronize blocks? Any explanation would be great.
Update
I have a getter and a setter for the map (encapsulated in a custom type) which can be used simultaneously by both threads, is a ConcurrentHashMap still a good solution? Or maybe I should synchronize the getter/setter (or perhaps declare the instance variable to be volatile)? Just want to make sure that this extra detail doesn't change the solution.

As long as you perform all operation in one method call to the concurrent hash map, you don't need to use additional locking. Unfortunately if you need to perform a number of methods atomically, you have to use locking, in which case using concurrent hash map doesn't help and you may as well use a plain HashMap.
#James' suggestion got me thinking as to whether tuning un-need concurrency makes a ConcurrentHashMap faster. It should reduce memory, but you would need to have thousands of these to make much difference. So I wrote this test and it doesn't appear obvious that you would always need to tune the concurrency level.
warmup: Average access time 36 ns.
warmup2: Average access time 28 ns.
1 concurrency: Average access time 25 ns.
2 concurrency: Average access time 25 ns.
4 concurrency: Average access time 25 ns.
8 concurrency: Average access time 25 ns.
16 concurrency: Average access time 24 ns.
32 concurrency: Average access time 25 ns.
64 concurrency: Average access time 26 ns.
128 concurrency: Average access time 26 ns.
256 concurrency: Average access time 26 ns.
512 concurrency: Average access time 27 ns.
1024 concurrency: Average access time 28 ns.
Code
public static void main(String[] args) {
test("warmup", new ConcurrentHashMap());
test("warmup2", new ConcurrentHashMap());
for(int i=1;i<=1024;i+=i)
test(i+" concurrency", new ConcurrentHashMap(16, 0.75f, i));
}
private static void test(String description, ConcurrentHashMap map) {
Integer[] ints = new Integer[2000];
for(int i=0;i<ints.length;i++)
ints[i] = i;
long start = System.nanoTime();
for(int i=0;i<20*1000*1000;i+=ints.length) {
for (Integer j : ints) {
map.put(j,1);
map.get(j);
}
}
long time = System.nanoTime() - start;
System.out.println(description+": Average access time "+(time/20/1000/1000/2)+" ns.");
}
As #bestss points out, a larger concurrency level can be slower as it has poorer caching characteristics.
EDIT: Further to #betsss concern about whether loops get optimised if there is no method calls. Here is three loops, all the same but iterate a different number of times. They print
10M: Time per loop 661 ps.
100K: Time per loop 26490 ps.
1M: Time per loop 19718 ps.
10M: Time per loop 4 ps.
100K: Time per loop 17 ps.
1M: Time per loop 0 ps.
.
{
int loops = 10*1000 * 1000;
long product = 1;
long start = System.nanoTime();
for(int i=0;i< loops;i++)
product *= i;
long time = System.nanoTime() - start;
System.out.println("10M: Time per loop "+1000*time/loops+" ps.");
}
{
int loops = 100 * 1000;
long product = 1;
long start = System.nanoTime();
for(int i=0;i< loops;i++)
product *= i;
long time = System.nanoTime() - start;
System.out.println("100K: Time per loop "+1000*time/loops+" ps.");
}
{
int loops = 1000 * 1000;
long product = 1;
long start = System.nanoTime();
for(int i=0;i< loops;i++)
product *= i;
long time = System.nanoTime() - start;
System.out.println("1M: Time per loop "+1000*time/loops+" ps.");
}
// code for three loops repeated

That is sufficient, as the purpose of ConcurrentHashMap is to allow lockless get / put operations, but make sure you are using it with the correct concurrency level. From the docs:
Ideally, you should choose a value to accommodate as many threads as will ever concurrently modify the table. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention. But overestimates and underestimates within an order of magnitude do not usually have much noticeable impact. A value of one is appropriate when it is known that only one thread will modify and all others will only read. Also, resizing this or any other kind of hash table is a relatively slow operation, so, when possible, it is a good idea to provide estimates of expected table sizes in constructors.
See http://download.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentHashMap.html.
EDIT:
The wrappered getter/setter make no difference so long as it is still being read/written to by multiple threads. You could concurrently lock the whole map, but that defeats the purpose of using a ConcurrentHashMap.

A ConcurrentHashMap is a good solution for a situation involving lots of write operations and fewer read operations. The downside is that it is not guaranteed what writes a reader will see at any particular moment. So if you require the reader to see the most up-to-date version of the map, it is not a good solution.
From the Java 6 API documentation:
Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset. For aggregate operations such as putAll and clear, concurrent retrievals may reflect insertion or removal of only some entries.
If that is not acceptable for your project, your best solution is really a fully synchronous lock. Solutions for many write operations with few read operations, as far as I know, compromise up-to-date reads in order to achieve faster, non-blocked writing. If you do go with this solution, the Collections.synchronizedMap(...) method creates a fully-synchronized, single reader/writer wrapper for any map object. Easier than writing your own.

You'd be better off using ConcurrentHashMap, as it's implementation doesn't normally block reads. If you synchronize externally, you'll end up blocking most reads, as you don't have access to the internal knowledge of the impl. needed to not do so.

If there is only one writer it should be safe to just use ConcurrentHashMap. If you feel the need to synchronize there are other HashMaps that do the synchronization for you and will be faster than manually writing the synchronization.

Yes... and to optimize it better, you should set the concurrency level to 1.
From Javadoc:
The allowed concurrency among update operations is guided by the optional concurrencyLevel constructor argument (default 16), which is used as a hint for internal sizing. .... A value of one is appropriate when it is known that only one thread will modify and all others will only read.

The solution works because of memory consistency effects for ConcurrentMaps: As with other concurrent collections, actions in a thread prior to placing an object into a ConcurrentMap as a key or value happen-before actions subsequent to the access or removal of that object from the ConcurrentMap in another thread.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why does the iteration speed increase over time? [JAVA] - java

Related

Threads run in serial not parallel

Making computation-threads cancellable in a smart way

Starvation in non-blocking approaches

What really is to “warm up” threads on multithreading processing?

Java ConcurrentHashMap

Categories

Resources