Kotlin concurrency for ConcurrentHashMap - java

I am trying to support concurrency on a hashmap that gets periodically cleared. I have a cache that stores data for a period of time. After every 5 minutes, the data in this cache is sent to the server. Once I flush, I want to clear the cache. The problem is when I am flushing, data could potentially be written to this map while I am doing that with an existing key. How would I go about making this process thread safe?
data class A(val a: AtomicLong, val b: AtomicLong) {
fun changeA() {
a.incrementAndGet()
}
}
class Flusher {
private val cache: Map<String, A> = ConcurrentHashMap()
private val lock = Any()
fun retrieveA(key: String){
synchronized(lock) {
return cache.getOrPut(key) { A(key, 1) }
}
}
fun flush() {
synchronized(lock) {
// send data to network request
cache.clear()
}
}
}
// Existence of multiple classes like CacheChanger
class CacheChanger{
fun incrementData(){
flusher.retrieveA("x").changeA()
}
}
I am worried that the above cache is not properly synchronized. Are there better/right ways to lock this cache so that I don't lose out on data? Should I create a deepcopy of cache and clear it?
Since the above data could be being changed by another changer, could that not lead to problems?

You can get rid of the lock.
In the flush method, instead of reading the entire map (e.g. through an iterator) and then clearing it, remove each element one by one.
I'm not sure if you can use iterator's remove method (I'll check that in a moment), but you can take the keyset iterate over it, and for each key invoke cache.remove() - this will give you the value stored and remove it from the cache atomically.
The tricky part is how to make sure that the object of class A won't be modified just prior sending over network... You can do it as follows:
When you get the some x through retrieveA and modify the object, you need to make sure it is still in the cache. Simply invoke retrieve one more time. If you get exactly the same object it's fine. If it's different, then it means that object was removed and sent over network, but you don't know if the modification was also sent, or the state of the object prior to the modification was sent. Still, I think in your case, you can simply repeat the whole process (apply change and check if objects are the same). But it depends on the specifics of your application.
If you don't want to increment twice, then when sending the data over network, you'll have to read the content of the counter a, store it in some local variable and decrease a by that amount (usually it will get zero). Then in the CacheChanger, when you get a different object from the second retrieve, you can check if the value is zero (your modification was taken into account), or non-zero which means your modification came just a fraction of second to late, and you'll have to repeat the process.
You could also replace incrementAndGet with compareAndSwap, but this could yield slightly worse performance. In this approach, instead of incrementing, you try to swap a value that is greater by one. And before sending over network you try to swap the value to -1 to denote the value as invalid. If the second swap fails it means that someone has changed the value concurrently, you need to check it one more time in order to send the freshest value over network, and you repeat the process in a loop (breaking the loop only once the swap to -1 succeeds). In the case of swap to greater by one, you also repeat the process in a loop until the swap succeeds. If it fails, it either means that somebody else swapped to some greater value, or the Flusher swapped to -1. In the latter case you know that you have to call retrieveA one more time to get a new object.

The easiest solution (but with a worse performance) is to rely completely on locks.
You can change ConcurrentHashMap to a regular HashMap.
Then you have to apply all your changes directly in the function retrieve:
fun retrieveA(key: String, mod: (A) -> Unit): A {
synchronized(lock) {
val obj: A = cache.getOrPut(key) { A(key, 1) }
mod(obj)
cache.put(obj)
return obj
}
}
I hope it compiles (I'm not an expert on Kotlin).
Then you use it like:
class CacheChanger {
fun incrementData() {
flusher.retrieveA("x") { it.changeA() }
}
}
Ok, I admit this code is not really Kotlin ;) you should use a Kotlin lambda instead of the Consumer interface. It's been some time since I played a bit with Kotlin. If someone could fix it I'd be very grateful.

Related

How to prevent threads from reading inconsistent data from ConcurrentHashMap?

I'm using a ConcurrentHashMap<String, String> that works as a cache, and where read operations are performed to validate if an element is already in the cache and write operations to add an element to the cache.
So, my question is: what are the best practices to always read the most recent ConcorrentHashMap values?
I want to ensure data consistency and not have cases like:
With the map.get("key") method, the first thread validates that this key does not yet exist in the map, then it does the map.put("value")
The second thread reads the data before the first thread puts the element on the map, leading to inconsistent data.
Code example:
Optional<String> cacheValue = Optional.ofNullable(cachedMap.get("key"));
if (cacheValue.isPresent()) {
// Perform actions
} else {
cachedMap.putIfAbsent("key", "value");
// Perform actions
}
How can I ensure that my ConcurrentHashMap is synchronized and doesn't retrieve inconsistent data?
Should I perform these map operations inside a synchronized block?
You probably need to do it this way:
if (cachedMap.putIfAbsent("key", "value") == null) {
// Perform actions "IS NOT PRESENT"
} else {
// Perform actions "IS PRESENT"
}
Doing it in two checks is obviously not atomic, so if you're having problems with the wrong values getting put in the cache, then that's likely your problem.
what are the best practices to always read the most recent ConcurrentHashMap values?
Oracle's Javadoc for ConcurrentHashMap says, "Retrievals reflect the results of the most recently completed update operations holding upon their onset." In other words, any time you call map.get(...) or any other method on the map, you are always working with the "most recent" content.
*BUT*
Is that enough? Maybe not. If your program threads expect any kind of consistency between two or more keys in the map, or if your threads expect any kind of consistency between something that is stored in the map and something that is stored elsewhere, then you are going to need to provide some explicit higher-level synchronization between the threads.
I can't provide an example that would be specific to the problem that's puzzling you because your question doesn't really say what that problem is.

Concurrent byte array access in Java with as few locks as possible

I'm trying to reduce the memory usage for the lock objects of segmented data. See my questions here and here. Or just assume you have a byte array and every 16 bytes can (de)serialize into an object. Let us call this a "row" with row length of 16 bytes. Now if you modify such a row from a writer thread and read from multiple threads you need locking. And if you have a byte array size of 1MB (1024*1024) this means 65536 rows and the same number of locks.
This is a bit too much, also that I need much larger byte arrays, and I would like to reduce it to something roughly proportional to the number of threads. My idea was to create a
ConcurrentHashMap<Integer, LockHelper> concurrentMap;
where Integer is the row index and before a thread 'enters' a row it puts a lock object in this map (got this idea from this answer). But no matter what I think through I cannot find an approach that is really thread-safe:
// somewhere else where we need to write or read the row
LockHelper lock1 = new LockHelper();
LockHelper lock = concurrentMap.putIfAbsent(rowIndex, lock1);
lock.addWaitingThread(); // is too late
synchronized(lock) {
try {
// read or write row at rowIndex e.g. writing like
bytes[rowIndex/16] = 1;
bytes[rowIndex/16 + 1] = 2;
// ...
} finally {
if(lock.noThreadsWaiting())
concurrentMap.remove(rowIndex);
}
}
Do you see a possibility to make this thread-safe?
I have the feeling that this will look very similar like the concurrentMap.compute contstruct (e.g. see this answer) or could I even utilize this method?
map.compute(rowIndex, (key, value) -> {
if(value == null)
value = new Object();
synchronized (value) {
// access row
return value;
}
});
map.remove(rowIndex);
Is the value and the 'synchronized' necessary at all as we already know the compute operation is atomically?
// null is forbidden so use the key also as the value to avoid creating additional objects
ConcurrentHashMap<Integer, Integer> map = ...;
// now the row access looks really simple:
map.compute(rowIndex, (key, value) -> {
// access row
return key;
});
map.remove(rowIndex);
BTW: Since when we have this compute in Java. Since 1.8? Cannot find this in the JavaDocs
Update: I found a very similar question here with userIds instead rowIndices, note that the question contains an example with several problems like missing final, calling lock inside the try-finally-clause and lack of shrinking the map. Also there seems to be a library JKeyLockManager for this purpose but I don't think it is thread-safe.
Update 2: The solution seem to be really simple as Nicolas Filotto pointed out how to avoid the removal:
map.compute(rowIndex, (key, value) -> {
// access row
return null;
});
So this is really less memory intense BUT the simple segment locking with synchronized is at least 50% faster in my scenario.
Is the value and the synchronized necessary at all as we already
know the compute operation is atomically?
I confirm that it is not needed to add a synchronized block in this case as the compute method is done atomically as stated in the Javadoc of ConcurrentHashMap#compute(K key, BiFunction<? super K,? super V,? extends V> remappingFunction) that has been added with BiFunction since Java 8, I quote:
Attempts to compute a mapping for the specified key and its current
mapped value (or null if there is no current mapping). The entire
method invocation is performed atomically. Some attempted update
operations on this map by other threads may be blocked while
computation is in progress, so the computation should be short and
simple, and must not attempt to update any other mappings of this Map.
What you try to achieve with the compute method could be totally atomic if you make your BiFunction always returns null to remove the key atomically too such that everything will be done atomically.
map.compute(
rowIndex,
(key, value) -> {
// access row here
return null;
}
);
This way you will then fully rely on the locking mechanism of a ConcurrentHashMap to synchronize your accesses to your rows.

Does double-checked locking work with a final Map in Java?

I'm trying to implement a thread-safe Map cache, and I want the cached Strings to be lazily initialized. Here's my first pass at an implementation:
public class ExampleClass {
private static final Map<String, String> CACHED_STRINGS = new HashMap<String, String>();
public String getText(String key) {
String string = CACHED_STRINGS.get(key);
if (string == null) {
synchronized (CACHED_STRINGS) {
string = CACHED_STRINGS.get(key);
if (string == null) {
string = createString();
CACHED_STRINGS.put(key, string);
}
}
}
return string;
}
}
After writing this code, Netbeans warned me about "double-checked locking," so I started researching it. I found The "Double-Checked Locking is Broken" Declaration and read it, but I'm unsure if my implementation falls prey to the issues it mentioned. It seems like all the issues mentioned in the article are related to object instantiation with the new operator within the synchronized block. I'm not using the new operator, and Strings are immutable, so I'm not sure that if the article is relevant to this situation or not. Is this a thread-safe way to cache strings in a HashMap? Does the thread-safety depend on what action is taken in the createString() method?
No it's not correct because the first access is done out side of a sync block.
It's somewhat down to how get and put might be implemented. You must bare in mind that they are not atomic operations.
For example, what if they were implemented like this:
public T get(string key){
Entry e = findEntry(key);
return e.value;
}
public void put(string key, string value){
Entry e = addNewEntry(key);
//danger for get while in-between these lines
e.value = value;
}
private Entry addNewEntry(key){
Entry entry = new Entry(key, ""); //a new entry starts with empty string not null!
addToBuckets(entry); //now it's findable by get
return entry;
}
Now the get might not return null when the put operation is still in progress, and the whole getText method could return the wrong value.
The example is a bit convoluted, but you can see that correct behaviour of your code relies on the inner workings of the map class. That's not good.
And while you can look that code up, you cannot account for compiler, JIT and processor optimisations and inlining which effectively can change the order of operations just like the wacky but correct way I chose to write that map implementation.
Consider use of a concurrent hashmap and the method Map.computeIfAbsent() which takes a function to call to compute a default value if key is absent from the map.
Map<String, String> cache = new ConcurrentHashMap<>( );
cache.computeIfAbsent( "key", key -> "ComputedDefaultValue" );
Javadoc: If the specified key is not already associated with a value, attempts to compute its value using the given mapping function and enters it into this map unless null. The entire method invocation is performed atomically, so the function is applied at most once per key. Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple, and must not attempt to update any other mappings of this map.
Non-trivial problem domains:
Concurrency is easy to do and hard to do correctly.
Caching is easy to do and hard to do correctly.
Both are right up there with Encryption in the category of hard to get right without an intimate understanding of the problem domain and its many subtle side effects and behaviors.
Combine them and you get a problem an order of magnitude harder than either one.
This is a non-trivial problem that your naive implementation will not solve in a bug free manner. The HashMap you are using is not going to threadsafe if any accesses are not checked and serialized, it will not be performant and will cause lots of contention that will cause lot of blocking and latency depending on the use.
The proper way to implement a lazy loading cache is to use something like Guava Cache with a Cache Loader it takes care of all the concurrency and cache race conditions for you transparently. A cursory glance through the source code shows how they do it.
No, and ConcurrentHashMap would not help.
Recap: the double check idiom is typically about assigning a new instance to a variable/field; it is broken because the compiler can reorder instructions, meaning the field can be assigned with a partially constructed object.
For your setup, you have a distinct issue: the map.get() is not safe from the put() which may be occurring thus possibly rehashing the table. Using a Concurrent hash map fixes ONLY that but not the risk of a false positive (that you think the map has no entry but it is actually being made). The issue is not so much a partially constructed object but the duplication of work.
As for the avoidable guava cacheloader: this is just a lazy-init callback that you give to the map so it can create the object if missing. This is essentially the same as putting all the 'if null' code inside the lock, which is certainly NOT going to be faster than good old direct synchronization. (The only times it makes sense to use a cacheloader is for pluggin-in a factory of such missing objects while you are passing the map to classes who don't know how to make missing objects and don't want to be told how).

Hashtable: why is get method synchronized?

I known a Hashtable is synchronized, but why its get() method is synchronized?
Is it only a read method?
If the read was not synchronized, then the Hashtable could be modified during the execution of read. New elements could be added, the underlying array could become too small and could be replaced by a bigger one, etc. Without sequential execution, it is difficult to deal with these situations.
However, even if get would not crash when the Hashtable is modified by another thread, there is another important aspect of the synchronized keyword, namely cache synchronization. Let's use a simplified example:
class Flag {
bool value;
bool get() { return value; } // WARNING: not synchronized
synchronized void set(bool value) { this->value = value; }
}
set is synchronized, but get isn't. What happens if two threads A and B simultaneously read and write to this class?
1. A calls read
2. B calls set
3. A calls read
Is it guaranteed at step 3 that A sees the modification of thread B?
No, it isn't, as A could be running on a different core, which uses a separate cache where the old value is still present. Thus, we have to force B to communicate the memory to other core, and force A to fetch the new data.
How can we enforce it? Everytime, a thread enters and leaves a synchronized block, an implicit memory barrier is executed. A memory barrier forces the cache to be updated. However, it is required that both the writer and the reader have to execute the memory barrier. Otherwise, the information is not properly communicated.
In our example, thread B already uses the synchronized method set, so its data modification is communicated at the end of the method. However, A does not see the modified data. The solution is to make get synchronized, so it is forced to get the updated data.
Have a look in Hashtable source code and you can think of lots of race conditions that can cause problem in a unsynchronized get() .
(I am reading JDK6 source code)
For example, a rehash() will create a empty array, and assign it to the instance var table, and put the entries from old table to the new one. Therefore if your get occurs after the empty array assignment, but before actually putting entries in it, you cannot find your key even it is in the table.
Another example is, there is a loop iterate thru the linked list at the table index, if in middle in your iteration, rehash happens. You may also failed to find the entry even it exists in the hashtable.
Hashtable is synchronized meaning the whole class is thread-safe
Inside the Hashtable, not only get() method is synchronized but also many other methods are. And particularly put() method is synchronized like Tom said.
A read method must be synchronized as a write method because it will make sure the visibility and the consistency of the variable.

HashMap synchronization for Map with value-incrementation

I have a questions regarding synchronization of HashMap. The background is that I am trying to implement a simple way of Brute-Force-Detection. I will use a map which has username as key and is used to save the amount of failed login attempts of the user. If a login fails, I want to do something like this:
Integer failedAmount = myMap.get("username");
if (failedAmount == null) {
myMap.put("username", 1);
} else {
failedAmount++;
if (failedAmount >= THRESHOLD) {
// possible brute force detected! alert admin / slow down login
// / or whatever
}
myMap.put("username", failedAmount);
}
The mechanism I have in mind at the moment is pretty simple: I would just track this for the whole day and clear() the HashMap at midnight or something like that.
so my question is:
what is the best/fastest Map implementation I can use for this? Do I need a fully schronized Map (Collections.sychronizedMap()) or is a ConcurrentHashMap sufficient? Or maybe even just a normal HashMap? I guess it's not that much of a problem if a few increments slipped through?
I would use a combination of ConcurrentHashMap and AtomicInteger http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/atomic/AtomicInteger.html.
Using AtomicInteger will not help you with the comparison, but it will help you with keeping numbers accurate - no need to doing the ++ and the put in two steps.
On the ConcurrentHashMap, I would use the putIfAbsent method, which will eliminate your first if condition.
AtomicInteger failedAmount = new AtomicInteger(0);
failedAmount = myMap.putIfAbsent("username", failedAmount);
if (failedAmount.incrementAndGet() >= THRESHOLD) {
// possible brute force detected! alert admin / slow down login
// / or whatever
}
Unless you synchronize the whole block of code that does the update, it won't work as you expect anyway.
A synchronized map just makes sure that nothing nasty happens if you call, say, put several times simultaneously. It doesn't make sure that
myMap.put("username", myMap.get("username") + 1);
is executed atomically.
You should really synchronize the whole block that performs the update. Either by using some Semaphore or by using the synchronized keyword. For example:
final Object lock = new Object();
...
synchronized(lock) {
if (!myMap.containsKey(username))
myMap.put(username, 0);
myMap.put(username, myMap.get(username) + 1);
if (myMap.get(username) >= THRESHOLD) {
// possible brute force detected! alert admin / slow down login
}
}
The best way I see would be to store the fail counter with the user object, not in some kind of global map. That way the synchronization issue does not even turn up.
If you still want to go with the map you can get aways with a partially synchronized approach, if you use a mutable counter object:
static class FailCount {
public int count;
}
// increment counter for user
FailCount count;
synchronized (lock) {
count = theMap.get(user);
if (count == null) {
count = new FailCount();
theMap.put(user, count);
}
}
count.count++;
But most likely any optimization attempt here is a waste of time. Its not like your system will process millions of login failures a second, so your orginal code should do just fine.
The simplest solution I see here is to extract this code into a separate function and make in synchronized. (or put all your code into a synchronized block). All other left unchanged. Map variable should be made final.
Using a synchronized HashMap or a ConcurrentHashMap is only necessary if your monitoring application is multi-threaded. If that is the case, ConcurrentHashMap has significantly better performance under cases of high load/contention.
I would not dare use an unsynchronized structure with multiple threads if even one writer/updater thread exists. It's not just a matter of losing a few increments - the internal structure of the HashMap itself could be corrupted.
That said, if you want to ensure that no increment is lost, then even a synchronized Map is not enough:
UserX attempts to login
Thread A gets count N for "UserX"
UserX attempts to login again
Thread B gets count N for "UserX"
A puts N + 1 to the map
B puts N + 1 to the map
The map now contains N + 1 instead of N + 2
To avoid this, either use a synchronized block for the whole get/set operation or use something along the lines of an AtomicInterer instead of a plain Integer for your counter.

Categories

Resources