Starvation in non-blocking approaches

Starvation in non-blocking approaches - java

I've been reading about non-blocking approaches for some time. Here is a piece of code for so called lock-free counter.
public class CasCounter {
private SimulatedCAS value;
public int getValue() {
return value.get();
}
public int increment() {
int v;
do {
v = value.get();
}
while (v != value.compareAndSwap(v, v + 1));
return v + 1;
}
}
I was just wondering about this loop:
do {
v = value.get();
}
while (v != value.compareAndSwap(v, v + 1));
People say:
So it tries again, and again, until all other threads trying to change the value have done so. This is lock free as no lock is used, but not blocking free as it may have to try again (which is rare) more than once (very rare).
My question is:
How can they be so sure about that? As for me I can't see any reason why this loop can't be infinite, unless JVM has some special mechanisms to solve this.

The loop can be infinite (since it can generate starvation for your thread), but the likelihood for that happening is very small. In order for you to get starvation you need some other thread succeeding in changing the value that you want to update between your read and your store and for that to happen repeatedly.
It would be possible to write code to trigger starvation but for real programs it would be unlikely to happen.
The compare and swap is usually used when you don't think you will have write conflicts very often. Say there is a 50% chance of "miss" when you update, then there is a 25% chance that you will miss in two loops and less than 0.1% chance that no update would succeed in 10 loops. For real world examples, a 50% miss rate is very high (basically not doing anything than updating), and as the miss rate is reduces, to say 1% then the risk of not succeeding in two tries is only 0.01% and in 3 tries 0.0001%.
The usage is similar to the following problem
Set a variable a to 0 and have two threads updating it with a = a+1 a million times each concurrently. At the end a could have any answer between 1000000 (every other update was lost due to overwrite) and 2000000 (no update was overwritten).
The closer to 2000000 you get the more likely the CAS usage is to work since that mean that quite often the CAS would see the expected value and be able to set with the new value.

Edit: I think I have a satisfactory answer now. The bit that confused me was the 'v != compareAndSwap'. In the actual code, CAS returns true if the value is equal to the compared expression. Thus, even if the first thread is interrupted between get and CAS, the second thread will succeed the swap and exit the method, so the first thread will be able to do the CAS.
Of course, it is possible that if two threads call this method an infinite number of times, one of them will not get the chance to run the CAS at all, especially if it has a lower priority, but this is one of the risks of unfair locking (the probability is very low however). As I've said, a queue mechanism would be able to solve this problem.
Sorry for the initial wrong assumptions.

Related

Why does the iteration speed increase over time? [JAVA]

I was playing around with loops in java, when I saw that the iteration speed keeps increasing.
Kind of seemed interesting.
Any ideas why?
Code:
import org.junit.jupiter.api.Test;
public class RandomStuffTest {
public static long iterationsPerSecond = 0;
#Test
void testIterationSpeed() {
Thread t = new Thread(()->{
try{
while (true){
System.out.println("Iterations per second: "+iterationsPerSecond);
iterationsPerSecond = 0;
Thread.sleep(1000);
}
} catch (Exception e) {
e.printStackTrace();
}
});
t.setDaemon(true);
t.start();
while (true){
for (long i = 0; i < Long.MAX_VALUE; i++) {
iterationsPerSecond++;
}
}
}
}
Output:
Iterations per second: 6111
Iterations per second: 2199824206
Iterations per second: 4539572003
Iterations per second: 6919540856
Iterations per second: 9442209284
Iterations per second: 11899448226
Iterations per second: 14313220638
Iterations per second: 16827637088
Iterations per second: 19322118707
Iterations per second: 21807781722
Iterations per second: 24256315314
Iterations per second: 26641505580
Another thing that I noticed:
The CPU usage was around 20% all the time and not really increasing...
Maybe because I was running the code as a test using Junit?

The problem is the Java Memory Model (JMM).
Every thread is allowed to have (does not have to do this) a local copy of each field. Whenever it writes or reads this field it is free to just set its local copy and sync it up with other threads' local copies much, much later.
Said differently, the JVM is free to re-order instructions, do things in parallel, and otherwise apply whatever weird stuff it wants to optimize your code, as long as certain guarantees are never broken.
One guarantee that is easy to understand: The JVM is free to reorder or parallelize 2 sequential instructions, but it must never be possible to write code that can observe this except through timing.
In other words, int x = 0; x = 5; System.out.println(x); must necessarily print 5 and never 0.
You can establish such relationships between 2 threads as well but this involves the use of volatile and/or synchronized and/or something that does this internally (most things in the java.util.concurrent package).
You didn't, so this result is meaningless. Most likely, the instruction iterationsPerSecond = 0 is having no effect; the code iterationsPerSecond++ reads 9442209284, increments by one, and writes it back - and that field got written to 0 someplace in the middle of all that, which thus accomplished nothing whatsoever.
If you want to test this properly, try a volatile variable, or better yet an AtomicLong.

Like already indicated, the code is broken due to a data race.
The JIT can do some funny stuff with your code because of the data race:
while (true){
for (long i = 0; i < Long.MAX_VALUE; i++) {
iterationsPerSecond++;
}
}
Since it doesn't know that another thread is also messing with the iterationsPerSecond, the compiler could fold the for loop because it can calculate the outcome of the loop:
while (true){
iterationsPerSecond=Long.MAX_VALUE
}
And it could even decide to pull out the write of the loop since the same value is written (loop invariant code motion):
iterationsPerSecond=Long.MAX_VALUE
while (true){
}
It could even decide the throw away the store, because it doesn't know there are any readers. So effectively it is a dead store and hence it can apply dead code elimination.
while (true){
}
An atomic or volatile would solve the problem because a happens before edge is established. Using a volatile or an atomiclong.get/set is equally expensive. It has the same compiler restrictions and fences on hardware level.
If you want to run microbenchmarks, I would suggest checking out JMH. It will protect you against a lot of trivial mistakes.

Java Micro-optimization: To cache or not to cache a System.currentTimeMillis() return value?

Simple question, which I've been wondering. Of the following two versions of code, which is better optimized? Assume that the time value resulting from the System.currentTimeMillis() call only needs to be pretty accurate, so caching should only be considered from a performance point of view.
This (with value caching):
long time = System.currentTimeMillis();
for (long timestamp : times) {
if (time - timestamp > 600000L) {
// Do something
}
}
Or this (no caching):
for (long timestamp : times) {
if (System.currentTimeMillis() - timestamp > 600000L) {
// Do something
}
}
I'm assuming System.currentTimeMillis() is already a very optimized and lightweight method call, but let's assume I'll be calling it many, many times in a short period.
How many values must the "times" collection/array contain to justify caching the return value of System.currentTimeMillis() in its own variable?
Is this better to do from a CPU or memory optimization point of view?

A long is basically free. A JVM with a JIT compiler can keep it in a register, and since it's a loop invariant can even optimize your loop condition to -timestamp < 600000L - time or timestamp > time - 600000L. i.e. the loop condition becomes a trivial compare between the iterator and a loop-invariant constant in a register.
So yes it's obviously more efficient to hoist a function call out of a loop and keep the result in a variable, especially when the optimizer can't do that for you, and especially when the result is a primitive type, not an Object.
Assuming your code is running on a JVM that JITs x86 machine code, System.currentTimeMillis() will probably include at least an rdtsc instruction and some scaling of that result1. So the cheapest it can possibly be (on Skylake for example) is a micro-coded 20-uop instruction with a throughput of one per 25 clock cycles (http://agner.org/optimize/).
If your // Do something is simple, like just a few memory accesses that usually hit in cache, or some simpler calculation, or anything else that out-of-order execution can do a good job with, that could be most of the cost of your loop. Unless each loop iterations typically takes multiple microseconds (i.e. time for thousands of instructions on a 4GHz superscalar CPU), hoisting System.currentTimeMillis() out of the loop can probably make a measurable difference. Small vs. huge will depend on how simple your loop body is.
If you can prove that hoisting it out of your loop won't cause correctness problems, then go for it.
Even with it inside your loop, your thread could still sleep for an unbounded length of time between calling it and doing the work for that iteration. But hoisting it out of the loop makes it more likely that you could actually observe this kind of effect in practice; running more iterations "too late".
Footnote 1: On modern x86, the time-stamp counter runs at a fixed rate, so it's useful as a low-overhead timesource, and less useful for cycle-accurate micro-benchmarking. (Use performance counters for that, or disable turbo / power saving so core clock = reference clock.)
IDK if a JVM would actually go to the trouble of implementing its own time function, though. It might just use an OS-provided time function. On Linux, gettimeofday and clock_gettime are implemented in user-space (with code + scale factor data exported by the kernel into user-space memory, in the VDSO region). So glibc's wrapper just calls that, instead of making an actual syscall.
So clock_gettime can be very cheap compared to an actual system call that switches to kernel mode and back. That can take at least 1800 clock cycles on Skylake, on a kernel with Spectre + Meltdown mitigation enabled.
So yes, it's hopefully safe to assume System.currentTimeMillis() is "very optimized and lightweight", but even rdtsc itself is expensive compared to some loop bodies.

In your case, method calls should always be hoisted out of loops.
System.currentTimeMillis() simply reads a value from OS memory, so it is very cheap (a few nanoseconds), as opposed to System.nanoTime(), which involves a call to hardware, and therefore can be orders of magnitude slower.

Is this algorithm starvation free?

I've stumbled upon a modified version of the Bakerys Algorithm (an uncomplete one of course with flaws)
I've been asked in class if the following algorithm is can have a starvation issue:
while(true){
number[me] = max(number[0],...,number[n]) + 1
for (other from 0 to n) {
while(number[other] != 0 && number[other] < number[me]) {
// Wait
}
}
/*CS*/
number[me] = 0
}
I understand that a deadlock is possible however, I'm asking is this algorithm starvation-free ?
I think that it is, because I can guarantee that once thread A has chosen a number, other threads will always have a bigger number than thread A and therefor he will eventually be allowed to enter the CS
My friend thinks that the algorithm is not starvation free, since a thread can be stuck in the process of taking a number (calculating the max) and possibly get its CPU time taken from him. Meanwhile other threads will start & finish and perhaps start again (since the while true) while supposedly thread A is being starved.
My question can be simplified to this:
Does the choosing array in the original Bakerys Algorithm solve starvation ?

Starvation-freedom can be defined as: Any process trying to enter critical section, will eventually enter critical section.
The line that calculates max is not part of the critical section, so it will eventually receive cpu time to make that assignment.
When a process A receives its id, then it will wait for all the other process that has an id lower than the one it has (lower id means that has more priority). Sometime that processes will leave the critical section and will get a new id. This id will be greater than the one it has and in that moment process A will enter in the critical section.
Finally, the algorithm is starvation-free.

StackOverflowError in Math.Random in a randomly recursive method

This is the context of my program.
A function has 50% chance to do nothing, 50% to call itself twice.
What is the probability that the program will finish?
I wrote this piece of code, and it works great apparently. The answer which may not be obvious to everyone is that this program has 100% chance to finish. But there is a StackOverflowError (how convenient ;) ) when I run this program, occuring in Math.Random(). Could someone point to me where does it come from, and tell me if maybe my code is wrong?
static int bestDepth =0;
static int numberOfPrograms =0;
#Test
public void testProba(){
for(int i = 0; i <1000; i++){
long time = System.currentTimeMillis();
bestDepth = 0;
numberOfPrograms = 0;
loop(0);
LOGGER.info("Best depth:"+ bestDepth +" in "+(System.currentTimeMillis()-time)+"ms");
}
}
public boolean loop(int depth){
numberOfPrograms++;
if(depth> bestDepth){
bestDepth = depth;
}
if(proba()){
return true;
}
else{
return loop(depth + 1) && loop(depth + 1);
}
}
public boolean proba(){
return Math.random()>0.5;
}
.
java.lang.StackOverflowError
at java.util.Random.nextDouble(Random.java:394)
at java.lang.Math.random(Math.java:695)
.
I suspect the stack and the amount of function in it is limited, but I don't really see the problem here.
Any advice or clue are obviously welcome.
Fabien
EDIT: Thanks for your answers, I ran it with java -Xss4m and it worked great.

Whenever a function is called or a non-static variable is created, the stack is used to place and reserve space for it.
Now, it seems that you are recursively calling the loop function. This places the arguments in the stack, along with the code segment and the return address. This means that a lot of information is being placed on the stack.
However, the stack is limited. The CPU has built-in mechanics that protect against issues where data is pushed into the stack, and eventually override the code itself (as the stack grows down). This is called a General Protection Fault. When that general protection fault happens, the OS notifies the currently running task. Thus, originating the Stackoverflow.
This seems to be happening in Math.random().
In order to handle your problem, I suggest you to increase the stack size using the -Xss option of Java.

As you said, the loop function recursively calls itself. Now, tail recursive calls can be rewritten to loops by the compiler, and not occupy any stack space (this is called the tail call optimization, TCO). Unfortunately, java compiler does not do that. And also your loop is not tail-recursive. Your options here are:
Increase the stack size, as suggested by the other answers. Note that this will just defer the problem further in time: no matter how large your stack is, its size is still finite. You just need a longer chain of recursive calls to break out of the space limit.
Rewrite the function in terms of loops
Use a language, which has a compiler that performs TCO
You will still need to rewrite the function to be tail-recursive
Or rewrite it with trampolines (only minor changes are needed). A good paper, explaining trampolines and generalizing them further is called "Stackless Scala with Free Monads".
To illustrate the point in 3.2, here's how the rewritten function would look like:
def loop(depth: Int): Trampoline[Boolean] = {
numberOfPrograms = numberOfPrograms + 1
if(depth > bestDepth) {
bestDepth = depth
}
if(proba()) done(true)
else for {
r1 <- loop(depth + 1)
r2 <- loop(depth + 1)
} yield r1 && r2
}
And initial call would be loop(0).run.

Increasing the stack-size is a nice temporary fix. However, as proved by this post, though the loop() function is guaranteed to return eventually, the average stack-depth required by loop() is infinite. Thus, no matter how much you increase the stack by, your program will eventually run out of memory and crash.
There is nothing we can do to prevent this for certain; we always need to encode the stack in memory somehow, and we'll never have infinite memory. However, there is a way to reduce the amount of memory you're using by about 2 orders of magnitude. This should give your program a significantly higher chance of returning, rather than crashing.
We can do this by noticing that, at each layer in the stack, there's really only one piece of information we need to run your program: the piece that tells us if we need to call loop() again or not after returning. Thus, we can emulate the recursion using a stack of bits. Each emulated stack-frame will require only one bit of memory (right now it requires 64-96 times that, depending on whether you're running in 32- or 64-bit).
The code would look something like this (though I don't have a Java compiler right now so I can't test it):
static int bestDepth = 0;
static int numLoopCalls = 0;
public void emulateLoop() {
//Our fake stack. We'll push a 1 when this point on the stack needs a second call to loop() made yet, a 0 if it doesn't
BitSet fakeStack = new BitSet();
long currentDepth = 0;
numLoopCalls = 0;
while(currentDepth >= 0)
{
numLoopCalls++;
if(proba()) {
//"return" from the current function, going up the callstack until we hit a point that we need to "call loop()"" a second time
fakeStack.clear(currentDepth);
while(!fakeStack.get(currentDepth))
{
currentDepth--;
if(currentDepth < 0)
{
return;
}
}
//At this point, we've hit a point where loop() needs to be called a second time.
//Mark it as called, and call it
fakeStack.clear(currentDepth);
currentDepth++;
}
else {
//Need to call loop() twice, so we push a 1 and continue the while-loop
fakeStack.set(currentDepth);
currentDepth++;
if(currentDepth > bestDepth)
{
bestDepth = currentDepth;
}
}
}
}
This will probably be slightly slower, but it will use about 1/100th the memory. Note that the BitSet is stored on the heap, so there is no longer any need to increase the stack-size to run this. If anything, you'll want to increase the heap-size.

The downside of recursion is that it starts filling up your stack which will eventually cause a stack overflow if your recursion is too deep. If you want to ensure that the test ends you can increase your stack size using the answers given in the follow Stackoverflow thread:
How to increase to Java stack size?

checking a value for reset value before resetting it - performance impact?

I have a variable that gets read and updated thousands of times a second. It needs to be reset regularly. But "half" the time, the value is already the reset value. Is it a good idea to check the value first (to see if it needs resetting) before resetting (a write operaion), or I should just reset it regardless? The main goal is to optimize the code for performance.
To illustrate:
Random r = new Random();
int val = Integer.MAX_VALUE;
for (int i=0; i<100000000; i++) {
if (i % 2 == 0)
val = Integer.MAX_VALUE;
else
val = r.nextInt();
if (val != Integer.MAX_VALUE) //skip check?
val = Integer.MAX_VALUE;
}
I tried to use the above program to test the 2 scenarios (by un/commenting the 2nd "if" line), but any difference is masked by the natural variance of the run duration time.
Thanks.

Don't check it.
It's more execution steps = more cycles = more time.
As an aside, you are breaking one of the basic software golden rules: "Don't optimise early". Unless you have hard evidence that this piece if code is a performance problem, you shouldn't be looking at it. (Note that doesn't mean you code without performance in mind, you still follow normal best practice, but you don't add any special code whose only purpose is "performance related")

The check has no actual performance impact. We'd be talking about a single clock cycle or something, which is usually not relevant in a Java program (as hard-core number crunching usually isn't done in Java).
Instead, base the decision on readability. Think of the maintainer who's going to change this piece of code five years on.
In the case of your example, using my rationale, I would skip the check.

Most likely the JIT will optimise the code away because it doesn't do anything.
Rather than worrying about performance, it is usually better to worry about what it
simpler to understand
cleaner to implement
In both cases, you might remove the code as it doesn't do anything useful and it could make the code faster as well.
Even if it did make the code a little slower it would be very small compared to the cost of calling r.nextInt() which is not cheap.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.