Java LongAdder sumThenReset Concurrency

Java LongAdder sumThenReset Concurrency - java

I am considering switching to using LongAdder from AtomicLong. I'm using it for counting requests reaching a server, and every 1 minute I write the result to a DB and start counting again. For that sake I used AtomicLong's getAndSet method, which I intend to replace with sumThenReset of LongAdder.
The documentation of sumThenReset states the following:
the returned value is not guaranteed to be the final value occurring before the reset
So what have we done here? Does it mean that some increments can by definition be lost and not counted anywhere?

I have found that if you extend the implementation of sumAndReset to do:
long sum = longAdder.sum();<br>
longAdder.add(-sum);<br>
return sum;
It works very nicely.
The reason is because the implementation behind the scene breaks up the storage into a number of cells. So the number of actual atomic adds are significantly reduced. So it is optimized for atomic adds. The sum, reset, and sumAndReset iterate over the cells and sum or reset or both and as such are not atomic. So instead using a negative add works very well. because you remove exactly what you summed, and return only that much as such maintain the over all counts.

Read the whole method doc:
This method may apply for example during quiescent points between
multithreaded computations. If there are updates concurrent with this
method, the returned value is not guaranteed to be the final value
occurring before the reset.
Quiescent means to quiet. That is, no threads are updating the LongAdder. So to answer your question, yes some increments may be lost.
From the Class doc:
This class is usually preferable to AtomicLong when multiple threads
update a common sum that is used for purposes such as collecting
statistics, not for fine-grained synchronization control.
Probably what you need is AtomicLong.

Related

Does Java LongAdder's increment() & sum() prevent getting the same value twice?

Currently I am using AtomicLong as a synchronized counter in my application, but I have found that with high concurrency/contention, e.g. with 8 threads my throughput is much lower (75% lower) then single-threaded for obvious reasons (e.g. concurrent CAS).
Use case:
A counter variable which
is updated by multiple threads concurrently
has high write contention, basically every usage in a thread will consist of a write with an immediate read afterwards
Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value.
It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.
So I tried to replace AtomicLong with a LongAdder, and indeed it looks from my measurements that my throughput with 8 threads is much better - (only) about 20% lower than single-threaded (compared to 75%).
However I'm not sure I correctly understand the way LongAdder works.
The JavaDoc says:
This class is usually preferable to AtomicLong when multiple threads
update a common sum that is used for purposes such as collecting
statistics, not for fine-grained synchronization control.
and for sum()
Returns the current sum. The returned value is NOT an atomic snapshot;
invocation in the absence of concurrent updates returns an accurate
result, but concurrent updates that occur while the sum is being
calculated might not be incorporated.
What is meant by fine-grained synchronization control ...
From looking at this so question and the source of AtomicLong and Striped64, I think I understand that if the update on an AtomicLong is blocked because of a CAS instruction issued by another thread, the update is stored thread-local and accumulated later to get some eventual consistency. So without further synchronization and because the incrementAndGet() in LongAdder is not atomic but two instructions, I fear the following is possible:
private static final LongAdder counter = new LongAdder(); // == 0
// no further synchronisation happening in java code
Thread#1 : counter.increment();
Thread#2 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#2 : counter.sum(); // == 1
Thread#3 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#3 : counter.sum(); // == 1
Thread#1 : counter.sum(); // == 3 (after merging everything)
If this is possible, AtomicLong is not really suitable for my use case, which probably then counts as "fine-grained synchronization control".
And then with my write/read^n pattern I probably can't do better then AtomicLong?

LongAdder is definitely not suitable for your use case of unique integer generation, but you don't need to understand the implementation or dig into the intricacies of the java memory model to determine that. Just look at the API: it has no compound "increment and get" type methods that would allow you to increment the value and get the old/new value back, atomically.
In terms of adding values, it only offers void add(long x) and void increment() methods, but these don't return a value. You mention:
the incrementAndGet in LongAdder is not atomic
... but I don't see incrementAndGet at all in LongAdder. Where are you looking?
Your idea of:
usage in a thread will consist of a w rite with an immediate read afterwards
Requirement is that each read
from the counter (immediately after the writing) gets a unique
incremented value. It is not required that each retrieved counter
value is increasing in the same order as the different
threads(writers) increment the value.
Doesn't work even for AtomicLong, unless by "write followed by a read" you mean calling the incrementAndGet method. I think it goes without saying that two separate calls on an AtomicLong or LongAdder (or any other object really) can never be atomic without some external locking.
So the Java doc, in my opinion, is a bit confusing. Yes, you should not use sum() for synchronization control, and yes "concurrent updates that occur while the sum is being calculated might not be incorporated"; however, the same is true of AtomicLong and its get() method. Increments that occur while calling get() similarly may or may not be reflected in the value returned by get().
Now there are some guarantees that are weaker with LongAdder compared to AtomicLong. One guarantee you get with AtomicLong is that a series of operations transition the object though a specific series of values, and where there is no guarantee on what specific value a thread will see, all the values should come from the true set of transition values.
For example, consider starting with an AtomicLong with value zero, and two threads incrementing it concurrently, by 1 and 3 respetively. The final value will always be 4, and only two possible transition paths are possible: 0 -> 1 -> 4 or 0 -> 3 -> 4. For a given execution, only one of those can have occurred and all concurrent reads will be consistent with that execution. That is, if any thread reads a 1, then no thread may read a 3 and vice versa (of course, there is no guarantee that any thread will see a 1 or 3 at all, they may all see 0 or 4.
LongCounter doesn't provide that guarantee. Since the write process is not locked, and the read process adds together several values in a not-atomic fashion, it is possible for one thread to see a 1 and another to see a 3 in the same execution. Of course, it still doesn't synthesize "fake" values - you should never read a "2" for example.
Now that's a bit of a subtle concept and the Javadoc doesn't get it across well. They go with a pretty weak and not particularly formal statement instead. Finally, I don't think you can observe the behavior above with pure increments (rather than additions) since there is only one path then: 0 -> 1 -> 2 -> 3, etc. So for increments, I think AtomicLong.get() and LongCounter.sum() have pretty much the same guarantees.
Something Useful
OK, so I'll give you something that might be useful. You can still implement what you want for efficiently, as long as you don't have strict requirements on the exact relationship between the counter value each thread gets and the order they were read.
Re-purpose the LongAdder Idea
You could make the LongAdder idea work fine for unique counter generation. The underlying idea of LongAdder is to spread the counter into N distinct counters (which live on separate cache lines). Any given call updates one of those counters based on the current thread ID2, and a read needs to sum the values from all counters. This means that writes have low contention, at the cost of a bit more complexity, and at a large cost to reads.
Now way the write works by design doesn't let you read the full LongAdder value, but since you just want a unique value you could use the same code except with the top or bottom N bits3 set uniquely per counter.
Now the write can return the prior value, like getAndIncrement and it will be unique because the fixed bits keep it unique among all counters in that object.
Thread-local Counters
A very fast and simple way is to use a unique value per thread, and a thread-local counter. When the thread local is initialized, it gets a unique ID from a shared counter (only once per thread), and then you combine that ID with a thread-local counter - for example, the bottom 24-bits for the ID, and the top 40-bits for the local counter1. This should be very fast, and more importantly essentially zero contention.
The downside is that the values of the counters won't have any specific relationship among threads (although they may still be strictly increasing within a thread). For example, a thread which has recently requested a counter value may get a much smaller one than a long existing value. You haven't described how you'll use these so I don't know if it is a problem.
Also, you don't have a single place to read the "total" number of counters allocated - you have to examine all the local counters to do that. This is doable if your application requires it (and has some of the same caveats as the LongAdder.sum() function).
A different solution, if you want the numbers to be "generally increasing with time" across threads, and know that every thread requests counter values reasonably frequently, is to use a single global counter, which threads request a local "allocation" of a number of IDs, from which it will then allocate individual IDs in a thread-local manner. For example, threads may request 10 IDs, so that three threads will be allocated the range 0-9, 10-19, and 20-29, etc. They then allocate out of that range until it is exhausted and which point they go back to the global counter. This is similar to how memory allocators carve out chunks of a common pool which can then be allocated thread-local.
The example above will keep the IDs roughly in increasing order over time, and each threads IDs will be strictly increasing as well. It doesn't offer any strict guarantees though: a thread that is allocated the range 0-9, could very well sleep for hours after using 0, and then use "1" when the counters on other threads are much higher. It would reduce contention by a factor of 10.
There are a variety of other approaches you could use and mostof them trade-off contention reduction versus the "accuracy" of the counter assignment versus real time. If you had access to the hardware, you could probably use a quickly incrementing clock like the cycle counter (e.g., rdtscp) and the core ID to get a unique value that is very closely tied to realtime (assuming the OS is synchronizing the counters).
1 The bit-field sizes should be chosen carefully based on the expected number of threads and per-thread increments in your application. In general, if you are constantly creating new threads and your application is long-lived, you may want to err on the side of more bits to the thread ID, since you can always detect a wrap of the local counter and get a new thread ID, so bits allocated to the thread ID can be efficiently shared with the local counters (but not the other way around).
2 The optimal is to use the 'CPU ID', but that's not directly accessible in Java (and even at the assembly level there is no fast and portable way to get it, AFAIK) - so the thread ID is used as a proxy.
3 Where N is lg2(number of counters).

There's a subtle difference between the two implementations.
An AtomicLong holds a single number which every thread will attempt to update. Because of this, as you have already found, only one thread can update this value at a time. The advantage, though, is that the value will always be up-to-date when a get is called, as there will be no adds in progress at that time.
A LongAdder, on the other hand, is made up of multiple values, and each value will be updated by a subset of the threads. This results in less contention when updating the value, however it is possible for sum to have an incomplete value if done while an add is in progress, similar to the scenario you described.
LongAdder is recommended for those cases where you will be doing a bunch of adds in parallel followed by a sum at the end. For your use case, I wrote the following which confirmed that around 1 in 10 sums were be repeated (which renders LongAdder unusable for your use case).
public static void main (String[] args) throws Exception
{
LongAdder adder = new LongAdder();
ExecutorService executor = Executors.newFixedThreadPool(10);
Map<Long, Integer> count = new ConcurrentHashMap<>();
for (int i = 0; i < 10; i++)
{
executor.execute(() -> {
for (int j = 0; j < 1000000; j++)
{
adder.add(1);
count.merge(adder.longValue(), 1, Integer::sum);
}
});
}
executor.shutdown();
executor.awaitTermination(1, TimeUnit.HOURS);
count.entrySet().stream().filter(e -> e.getValue() > 1).forEach(System.out::println);
}

How atomicity is achieved in the classes defined in java.util.concurrent.atomic package?

I was going through the source code of java.util.concurrent.atomic.AtomicInteger to find out how atomicity is achieved by the atomic operations provided by the class. For instance AtomicInteger.getAndIncrement() method source is as follows
public final int getAndIncrement() {
for (;;) {
int current = get();
int next = current + 1;
if (compareAndSet(current, next))
return current;
}
}
I am not able to understand the purpose of writing the sequence of operations inside a infinite for loop. Does it serve any special purpose in Java Memory Model (JMM). Please help me find a descriptive understanding. Thanks in advance.

I am not able to understand the purpose of writing the sequence of operations inside a infinite for loop.
The purpose of this code is to ensure that the volatile field gets updated appropriately without the overhead of a synchronized lock. Unless there are a large number of threads all competing to update this same field, this will most likely spin a very few times to accomplish this.
The volatile keyword provides visibility and memory synchronization guarantees but does not in itself ensure atomic operations with multiple operations (test and set). If you are testing and then setting a volatile field there are race-conditions if multiple threads are trying to perform the same operation at the same time. In this case, if multiple threads are trying to increment the AtomicInteger at the same time, you might miss one of the increments. The concurrent code here uses the spin loop and the compareAndSet underlying methods to make sure that the volatile int is only updated to 4 (for example) if it still is equal to 3.
t1 gets the atomic-int and it is 0.
t2 gets the atomic-int and it is 0.
t1 adds 1 to it
t1 atomically tests to make sure it is 0, it is, and stores 1.
t2 adds 1 to it
t2 atomically tests to make sure it is 0, it is not, so it has to spin and try again.
t2 gets the atomic-int and it is 1.
t2 adds 1 to it
t2 atomically tests to make sure it is 1, it is, and stores 2.
Does it serve any special purpose in Java Memory Model (JMM).
No, it serves the purpose of the class and method definitions and uses the JMM and the language definitions around volatile to achieve its purpose. The JMM defines what the language does with the synchronized, volatile, and other keywords and how multiple threads interact with cached and central memory. This is mostly about native code interactions with operating system and hardware and is rarely, if ever, about Java code.
It is the compareAndSet(...) method which gets closer to the JMM by calling into the Unsafe class which is mostly native methods with some wrappers:
public final boolean compareAndSet(int expect, int update) {
return unsafe.compareAndSwapInt(this, valueOffset, expect, update);
}

I am not able to understand the purpose of writing the sequence of
operations inside a infinite for loop.
To understand why it is in an infinite loop I find it helpful to understand what the compareAndSet does and how it may return false.
Atomically sets the value to the given updated value if the current
value == the expected value.
Parameters:
expect - the expected value
update - the new value
Returns:
true if successful. False return indicates that the actual value was not
equal to the expected value
So you read the Returns message and ask how is that possible?
If two threads are invoking incrementAndGet at close to the same time, and they both enter and see the value current == 1. Both threads will create a thread-local next == 2 and try to set via compareAndSet. Only one thread will win as per documented and the thread that loses must try again.
This is how CAS works. You attempt to change the value if you fail, try again, if you succeed then continue on.
Now simply declaring the field as volatile will not work because incrementing is not atomic. So something like this is not safe from the scenario I explained
volatile int count = 0;
public int incrementAndGet(){
return ++count; //may return the same number more than once.
}

Java's compareAndSet is based on CPU compare-and-swap (CAS) instructions see http://en.wikipedia.org/wiki/Compare-and-swap. It compares the contents of a memory location to a given value and, only if they are the same, modifies the contents of that memory location to a given new value.
In case of incrementAndGet we read the current value and call compareAndSet(current, current + 1). If it returns false it means that another thread interfered and changed the current value, which means that our attempt failed and we need to repeat the whole cycle until it succeeds.

Is it a sensible optimization to check whether a variable holds a specific value before writing that value?

if (var != X)
var = X;
Is it sensible or not? Will the compiler always optimize-out the if statement? Are there any use cases that would benefit from the if statement?
What if var is a volatile variable?
I'm interested in both C++ and Java answers as the volatile variables have different semantics in both of the languages. Also the Java's JIT-compiling can make a difference.
The if statement introduces branching and additional read that wouldn't happen if we always overwrote var with X, so it's bad. On the other hand, if var == X then using this optimization we perform only a read and we do not perform a write, which could have some effects on cache. Clearly, there are some trade-offs here. I'd like to know how it looks like in practice. Has anyone done any testing on this?
EDIT:
I'm mostly interested about how it looks like in a multi-processor environment. In a trivial situation there doesn't seem to be much sense in checking the variable first. But when cache coherency has to be kept between processors/cores the extra check might be actually beneficial. I just wonder how big impact can it have? Also shouldn't the processor do such an optimization itself? If var == X assigning it once more value X should not 'dirt-up' the cache. But can we rely on this?

Yes, there are definitely cases where this is sensible, and as you suggest, volatile variables are one of those cases - even for single threaded access!
Volatile writes are expensive, both from a hardware and a compiler/JIT perspective. At the hardware level, these writes might be 10x-100x more expensive than a normal write, since write buffers have to be flushed (on x86, the details will vary by platform). At the compiler/JIT level, volatile writes inhibit many common optimizations.
Speculation, however, can only get you so far - the proof is always in the benchmarking. Here's a microbenchmark that tries your two strategies. The basic idea is to copy values from one array to another (pretty much System.arraycopy), with two variants - one which copies unconditionally, and one that checks to see if the values are different first.
Here are the copy routines for the simple, non-volatile case (full source here):
// no check
for (int i=0; i < ARRAY_LENGTH; i++) {
target[i] = source[i];
}
// check, then set if unequal
for (int i=0; i < ARRAY_LENGTH; i++) {
int x = source[i];
if (target[i] != x) {
target[i] = x;
}
}
The results using the above code to copy an array length of 1000, using Caliper as my microbenchmark harness, are:
benchmark arrayType ns linear runtime
CopyNoCheck SAME 470 =
CopyNoCheck DIFFERENT 460 =
CopyCheck SAME 1378 ===
CopyCheck DIFFERENT 1856 ====
This also includes about 150ns of overhead per run to reset the target array each time. Skipping the check is much faster - about 0.47 ns per element (or around 0.32 ns per element after we remove the setup overhead, so pretty much exactly 1 cycle on my box).
Checking is about 3x slower when the arrays are the same, and 4x slower then they are different. I'm surprised at how bad the check is, given that it is perfectly predicted. I suspect that the culprit is largely the JIT - with a much more complex loop body, it may be unrolled fewer times, and other optimizations may not apply.
Let's switch to the volatile case. Here, I've used AtomicIntegerArray as my arrays of volatile elements, since Java doesn't have any native array types with volatile elements. Internally, this class is just writing straight through to the array using sun.misc.Unsafe, which allows volatile writes. The assembly generated is substantially similar to normal array access, other than the volatile aspect (and possibly range check elimination, which may not be effective in the AIA case).
Here's the code:
// no check
for (int i=0; i < ARRAY_LENGTH; i++) {
target.set(i, source[i]);
}
// check, then set if unequal
for (int i=0; i < ARRAY_LENGTH; i++) {
int x = source[i];
if (target.get(i) != x) {
target.set(i, x);
}
}
And here are the results:
arrayType benchmark us linear runtime
SAME CopyCheckAI 2.85 =======
SAME CopyNoCheckAI 10.21 ===========================
DIFFERENT CopyCheckAI 11.33 ==============================
DIFFERENT CopyNoCheckAI 11.19 =============================
The tables have turned. Checking first is ~3.5x faster than the usual method. Everything is much slower overall - in the check case, we are paying ~3 ns per loop, and in the worst cases ~10 ns (the times above are in us, and cover the copy of the whole 1000 element array). Volatile writes really are more expensive. There is about 1 ns of overheaded included in the DIFFERENT case to reset the array on each iteration (which is why even the simple is slightly slower for DIFFERENT). I suspect a lot of the overhead in the "check" case is actually bounds checking.
This is all single threaded. If you actual had cross-core contention over a volatile, the results would be much, much worse for the simple method, and just about as good as the above for the check case (the cache line would just sit in the shared state - no coherency traffic needed).
I've also only tested the extremes of "every element equal" vs "every element different". This means the branch in the "check" algorithm is always perfectly predicted. If you had a mix of equal and different, you wouldn't get just a weighted combination of the times for the SAME and DIFFERENT cases - you do worse, due to misprediction (both at the hardware level, and perhaps also at the JIT level, which can no longer optimize for the always-taken branch).
So whether it is sensible, even for volatile, depends on the specific context - the mix of equal and unequal values, the surrounding code and so on. I'd usually not do it for volatile alone in a single-threaded scenario, unless I suspected a large number of sets are redundant. In heavily multi-threaded structures, however, reading and then doing a volatile write (or other expensive operation, like a CAS) is a best-practice and you'll see it quality code such as java.util.concurrent structures.

Is it a sensible optimization to check whether a variable holds a specific value before writing that value?
Are there any use cases that would benefit from the if statement?
It is when assignment is significantly more costly than an inequality comparison returning false.
A example would be a large* std::set, which may require many heap allocations to duplicate.
**for some definition of "large"*
Will the compiler always optimize-out the if statement?
That's a fairly safe "no", as are most questions that contain both "optimize" and "always".
The C++ standard makes rare mention of optimizations, but never demands one.
What if var is a volatile variable?
Then it may perform the if, although volatile doesn't achieve what most people assume.

In general the answer is no. Since if you have simple datatype, compiler would be able to perform any necessary optimizations. And in case of types with heavy operator= it is responsibility of operator= to choose optimal way to assign new value.

There are situations where even a trivial assignment of say a pointersized variable can be more expensive than a read and branch (especially if predictable).
Why? Multithreading. If several threads are only reading the same value, they can all share that value in their caches. But as soon as you write to it, you have to invalidate the cacheline and get the new value the next time you want to read it or you have to get the updated value to keep your cache coherent. Both situations lead to more traffic between the cores and add latency to the reads.
If the branch is pretty unpredictable though it's probably still slower.

In C++, assigning a SIMPLE variable (that is, a normal integer or float variable) is definitely and always faster than checking if it already has that value and then setting it if it didn't have the value. I would be very surprised if this wasn't true in Java too, but I don't know how complicated or simple things are in Java - I've written a few hundred lines, and not actually studied how byte code and JITed bytecode actually works.
Clearly, if the variable is very easy to check, but complicated to set, which could be the case for classes and other such things, then there may be a value. The typical case where you'd find this would be in some code where the "value" is some sort of index or hash, but if it's not a match, a whole lot of work is required. One example would be in a task-switch:
if (current_process != new_process_to_run)
current_process == new_process_to_run;
Because here, a "process" is a complex object to alter, but the != can be done on the ID of the process.
Whether the object is simple or complex, the compiler will almost certainly not understand what you are trying to do here, so it will probably not optimize it away - but compilers are more clever than you think SOMETIMES, and more stupid at other times, so I wouldn't bet either way.
volatile should always force the compiler to read and write values to the variable, whether it "thinks" it is necessary or not, so it will definitely READ the variable and WRITE the variable. Of course, if the variable is volatile it probably means that it can change or represents some hardware, so you should be EXTRA careful with how you treat it yourself too... An extra read of a PCI-X card could incur several bus cycles (bus cycles being an order of magnitude slower than the processor speed!), which is likely to affect the performance much more. But then writing to a hardware register may (for example) cause the hardware to do something unexpected, and checking that we have that value first MAY make it faster, because "some operation starts over", or something like that.

It would be sensible if you had read-write locking semantics involved, whenever reading is usually less disruptive than writing.

In Objective-C you have the situation where assigning a object address to a pointer variable may require that the object be "retained" (reference count incremented). In such a case it makes sense to see if the value being assigned is the same as the value currently in the pointer variable, to avoid having to do the relatively expensive increment/decrement operations.
Other languages that use reference counting likely have similar scenarios.
But when assigning, say, an int or a boolean to a simple variable (outside of the multiprocessor cache scenario mentioned elsewhere) the test is rarely merited. The speed of a store in most processors is at least as fast as the load/test/branch.

In java the answer is always no. All assignments you can do in Java are primitive. In C++, the answer is still pretty much always no - if copying is so much more expensive than an equality check, the class in question should do that equality check itself.

Is this incrementAndGet thread-safe? It seems to pull object from eh cache

This servlet seems to fetch an object from ehCache, from an Element which has this object: http://code.google.com/p/adwhirl/source/browse/src/obj/HitObject.java?repo=servers-mobile
It then goes on to increment the counter which is an atomic long:
http://code.google.com/p/adwhirl/source/browse/src/servlet/MetricsServlet.java?repo=servers-mobile#174
//Atomically record the hit
if(i_hitType == AdWhirlUtil.HITTYPE.IMPRESSION.ordinal()) {
ho.impressions.incrementAndGet();
}
else {
ho.clicks.incrementAndGet();
}
This doesn't seem thread-safe to me as multiple threads could be fetching from the cache and if both increment at the same time you might loose a click/impression count.
Do you agree that this is not thread-safe?

AtomicLong and AtomicInteger use a CAS internally -- compare and set (or compare-and-swap). The idea is that you tell the CAS two things: the value you expect the long/int to have, and the value you want to update it to. If the long/int has the value you say it should have, the CAS will atomically make the update and return true; otherwise, it won't make the update, and it'll return false. Many modern chips support CAS very efficiently at the machine-code level; if the JVM is running in an environment that doesn't have a CAS, it can use mutexes (what Java calls synchronization) to implement the CAS. Regardless, once you have a CAS, you can safely implement an atomic increment via this logic (in pseudocode):
long incrementAndGet(atomicLong, byIncrement)
do
oldValue = atomicLong.get() // 1
newValue = oldValue + byIncrement
while ! atomicLong.cas(oldValue, newValue) // 2
return newValue
If another thread has come in and does its own increment between lines // 1 and // 2, the CAS will fail and the loop will try again. Otherwise, the CAS will succeed.
There's a gamble in this kind of approach: if there's low contention, a CAS is faster than a synchronized block isn't as likely to cause a thread context switch. But if there's a lot of contention, some threads are going to have to go through multiple loop iterations per increment, which obviously amounts to wasted work. Generally speaking, the incrementAndGet is going to be faster under most common loads.

The increment is thread safe since AtomicInteger and family guarantee that. But there is a problem with the insertion and fetching from the cache, where two (or more) HitObject could be created and inserted. That would cause potentially losing some hits on the first time this HitObject is accessed. As #denis.solonenko has pointed, there is already a TODO in the code to fix this.
However I'd like to point out that this code only suffers from concurrency on first accessing a given HitObject. Once you have the HitObject in the cache (and there are no more threads creating or inserting the HitObject) then this code is perfectly thread-safe. So this is only a very limited concurrency problem, and probably that's the reason they have not yet fixed it.

What is the name of this locking technique?

I've got a gigantic Trove map and a method that I need to call very often from multiple threads. Most of the time this method shall return true. The threads are doing heavy number crunching and I noticed that there was some contention due to the following method (it's just an example, my actual code is bit different):
synchronized boolean containsSpecial() {
return troveMap.contains(key);
}
Note that it's an "append only" map: once a key is added, is stays in there forever (which is important for what comes next I think).
I noticed that by changing the above to:
boolean containsSpecial() {
if ( troveMap.contains(key) ) {
// most of the time (>90%) we shall pass here, dodging lock-acquisition
return true;
}
synchronized (this) {
return troveMap.contains(key);
}
}
I get a 20% speedup on my number crunching (verified on lots of runs, running during long times etc.).
Does this optimization look correct (knowing that once a key is there it shall stay there forever)?
What is the name for this technique?
EDIT
The code that updates the map is called way less often than the containsSpecial() method and looks like this (I've synchronized the entire method):
synchronized void addSpecialKeyValue( key, value ) {
....
}

This code is not correct.
Trove doesn't handle concurrent use itself; it's like java.util.HashMap in that regard. So, like HashMap, even seemingly innocent, read-only methods like containsKey() could throw a runtime exception or, worse, enter an infinite loop if another thread modifies the map concurrently. I don't know the internals of Trove, but with HashMap, rehashing when the load factor is exceeded, or removing entries can cause failures in other threads that are only reading.
If the operation takes a significant amount of time compared to lock management, using a read-write lock to eliminate the serialization bottleneck will improve performance greatly. In the class documentation for ReentrantReadWriteLock, there are "Sample usages"; you can use the second example, for RWDictionary, as a guide.
In this case, the map operations may be so fast that the locking overhead dominates. If that's the case, you'll need to profile on the target system to see whether a synchronized block or a read-write lock is faster.
Either way, the important point is that you can't safely remove all synchronization, or you'll have consistency and visibility problems.

It's called wrong locking ;-) Actually, it is some variant of the double-checked locking approach. And the original version of that approach is just plain wrong in Java.
Java threads are allowed to keep private copies of variables in their local memory (think: core-local cache of a multi-core machine). Any Java implementation is allowed to never write changes back into the global memory unless some synchronization happens.
So, it is very well possible that one of your threads has a local memory in which troveMap.contains(key) evaluates to true. Therefore, it never synchronizes and it never gets the updated memory.
Additionally, what happens when contains() sees a inconsistent memory of the troveMap data structure?
Lookup the Java memory model for the details. Or have a look at this book: Java Concurrency in Practice.

This looks unsafe to me. Specifically, the unsynchronized calls will be able to see partial updates, either due to memory visibility (a previous put not getting fully published, since you haven't told the JMM it needs to be) or due to a plain old race. Imagine if TroveMap.contains has some internal variable that it assumes won't change during the course of contains. This code lets that invariant break.
Regarding the memory visibility, the problem with that isn't false negatives (you use the synchronized double-check for that), but that trove's invariants may be violated. For instance, if they have a counter, and they require that counter == someInternalArray.length at all times, the lack of synchronization may be violating that.
My first thought was to make troveMap's reference volatile, and to re-write the reference every time you add to the map:
synchronized (this) {
troveMap.put(key, value);
troveMap = troveMap;
}
That way, you're setting up a memory barrier such that anyone who reads the troveMap will be guaranteed to see everything that had happened to it before its most recent assignment -- that is, its latest state. This solves the memory issues, but it doesn't solve the race conditions.
Depending on how quickly your data changes, maybe a Bloom filter could help? Or some other structure that's more optimized for certain fast paths?

Under the conditions you describe, it's easy to imagine a map implementation for which you can get false negatives by failing to synchronize. The only way I can imagine obtaining false positives is an implementation in which key insertions are non-atomic and a partial key insertion happens to look like another key you are testing for.
You don't say what kind of map you have implemented, but the stock map implementations store keys by assigning references. According to the Java Language Specification:
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32 or 64 bit values.
If your map implementation uses object references as keys, then I don't see how you can get in trouble.
EDIT
The above was written in ignorance of Trove itself. After a little research, I found the following post by Rob Eden (one of the developers of Trove) on whether Trove maps are concurrent:
Trove does not modify the internal structure on retrievals. However, this is an implementation detail not a guarantee so I can't say that it won't change in future versions.
So it seems like this approach will work for now but may not be safe at all in a future version. It may be best to use one of Trove's synchronized map classes, despite the penalty.

I think you would be better off with a ConcurrentHashMap which doesn't need explicit locking and allows concurrent reads
boolean containsSpecial() {
return troveMap.contains(key);
}
void addSpecialKeyValue( key, value ) {
troveMap.putIfAbsent(key,value);
}
another option is using a ReadWriteLock which allows concurrent reads but no concurrent writes
ReadWriteLock rwlock = new ReentrantReadWriteLock();
boolean containsSpecial() {
rwlock.readLock().lock();
try{
return troveMap.contains(key);
}finally{
rwlock.readLock().release();
}
}
void addSpecialKeyValue( key, value ) {
rwlock.writeLock().lock();
try{
//...
troveMap.put(key,value);
}finally{
rwlock.writeLock().release();
}
}

Why you reinvent the wheel?
Simply use ConcurrentHashMap.putIfAbsent

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.