I found out that put and get with CacheLoader operations use Reentrant lock under the hood, but why this is not implemented for getIfPresent operation?
get which is used by getIfPresent
#Nullable
V get(Object key, int hash) {
try {
if (this.count != 0) {
long now = this.map.ticker.read();
ReferenceEntry<K, V> e = this.getLiveEntry(key, hash, now);
Object value;
if (e == null) {
value = null;
return value;
}
value = e.getValueReference().get();
if (value != null) {
this.recordRead(e, now);
Object var7 = this.scheduleRefresh(e, e.getKey(), hash, value, now, this.map.defaultLoader);
return var7;
}
this.tryDrainReferenceQueues();
}
Object var11 = null;
return var11;
} finally {
this.postReadCleanup();
}
}
put
#Nullable
V put(K key, int hash, V value, boolean onlyIfAbsent) {
this.lock();
.....
Is the only thing I can do to reach thread-safety in basic get/put operations is to use synchronization on client ?
Even if getIfPresent did use locks, that won't help. It's more fundamental than that.
Let me put that differently: Define 'threadsafe'.
Here's an example of what can happen in a non-threadsafe implementation:
You invoke .put on a plain jane j.u.HashMap, not holding any locks.
Simultaneously, a different thread also does that.
The map is now in a broken state. If you iterate through the elements, the first put statement doesn't show at all, the second put statement shows up in your iteration, and a completely unrelated key has disappeared. But calling .get(k) on that map with the second thread's key doesn't find it eventhough it is returned in the .entrySet(). This makes no sense and breaks all rules of j.u.HashMap. The spec of hashmap does not explain any of this, other than 'I am not threadsafe' and leaves it at that.
That's an example of NOT thread safe.
Here is an example of perfectly fine:
2 threads begin.
Some external event (e.g. a log) shows that thread 1 is very very very slightly ahead of thread 2, but the notion of 'ahead', if it is relevant, means your code is broken. That's just not how multicore works.
Thread 1 adds a thing to a concurrency-capable map, and logs that it has done so.
Thread 2 logs that it starts an operation. (From the few things you have observed, it seems to be running slightly 'later') so I guess we're "after" the point where T1 added the thing) now queries for the thing and does not get a result.1
That's fine. That's still thread safe. Thread safe doesn't mean every interaction with an instance of that data type can be understood in terms of 'first this thing happened, then that thing happened'. Wanting that is very problematic, because the only way the computer can really give you that kind of guarantee is to disable all but a single core and run everything very very slowly. The point of a cache is to speed things up, not slow things down!
The problem with the lack of guarantees here is that if you run multiple separate operations on the same object, you run into trouble. Here's some pseudocode for a bank ATM machine that will go epically wrong in the long run:
Ask user how much money they want (say, €50,-).
Retrieve account balance from a 'threadsafe' Map<Account, Integer> (maps account ID to cents in account).
Check if €50,-. If no, show error. If yes...
Spit out €50,-, and update the threadsafe map with .put(acct, balance - 5000).
Everything perfectly threadsafe. And yet this is going to go very very wrong - if the user uses their card at the same time they are in the bank withdrawing money via the teller, either the bank or the user is going to get very lucky here. I'd hope it's obvious to see how and why.
The upshot is: If you have dependencies between operations there is nothing you can do with 'threadsafe' concepts that can possibly fix it; the only way is to actually write code that explicitly marks off these dependencies.
The only way to write that bank code is to either use some form of locking. Basic locking, or optimistic locking, either way is fine, but locking of some sort. It has to look like2:
start some sort of transaction;
fetch account balance;
deal with insufficient funds;
spit out cash;
update account balance;
end transaction;
Now guava's code makes perfect sense:
There is no such thing as 'earlier' and 'later'. You need to stop thinking about multicore in that way. Unless you explicitly write primitives that establish these things. The cache interface does have these. Use the right operation! getIfPresent will get you the cache if it is possible for your current thread to get at that data. If it is not, it returns null, that's what that call does.
If instead you want this common operation: "Get me the cached value. However, if it is not available, then run this code to calculate the cached value, cache the result, and return it to me. In addition, ensure that if 2 threads simultaneously end up running this exact operation, only one thread runs the calculation, and the other will wait for the other one (don't say 'first' one, that's not how you should think about threads) to finish, and use that result instead".. then, use the right call for that: .cache.get(key, k -> calculateValueForKey(k)). As the docs explicitly call out this will wait for another thread that is also 'loading' the value (that's what guava cache calls the calculation process).
No matter what you invoke from the Cache API, you can't 'break it', in the sense that I broke that HashMap. The cache API does this partly by using locks (such as ReentrantLock for mutating operations on it), and partly by using a ConcurrentHashMap under the hood.
[1] Often log frameworks end up injecting an actual explicit lock in the proceedings and thus you do often get guarantees in this case, but only 'by accident' because of the log framework. This isn't a guarantee (maybe you're logging to separate log files, for example!) and often what you 'witness' may be a lie. For example, maybe you have 2 log statements that both log to separate files (and don't lock each other out at all), and they log the timestamp as part of the log. The fact that one log line says '12:00:05' and the other says '12:00:06' means nothing - the log thread fetches the current time, creates a string describing the message, and tells the OS to write it to the file. You obviously get absolutely no guarantee that the 2 log threads run at identical speed. Maybe one thread fetches the time (12:00:05), creates the string, wants to write to the disk but the OS switches to the other thread before the write goes through, the other thread is the other logger, it reads time (12:00:06), makes the string, writes it out, finishes up, and then the first logger continues, writes its context. Tada: 2 threads where you 'observe' one thread is 'earlier' but that is incorrect. Perhaps this example will further highlight why thinking about threads in terms of which one is 'first' steers you wrong.
[2] This code has the additional complication that you're interacting with systems that cannot be transactional. The point of a transaction is that you can abort it; you cannot abort the user grabbing a bill from the ATM. You solve that by logging that you're about to spit out the money, then spit out the money, then log that you have spit out the money. And finally write to this log that it has been processed in the user's account balance. Other code needs to check this log and act accordingly. For example, on startup the bank's DB machine needs to flag 'dangling' ATM transactions and will have to get a human to check the video feed. This solves the problem where someone trips over the power cable of the bank DB machine juuust as the user is about to grab the banknote from the machine.
Seems like guava cache is implementing ConcurrentMap api
class LocalCache<K, V> extends AbstractMap<K, V> implements ConcurrentMap<K, V>
so the base get and put operations should be thread safe by nature
Related
Suppose I have a method that checks for a id in the db and if the id doesn't exit then inserts a value with that id. How do I know if this is thread safe and how do I ensure that its thread safe. Are there any general rules that I can use to ensure that it doesn't contain race conditions and is generally thread safe.
public TestEntity save(TestEntity entity) {
if (entity.getId() == null) {
entity.setId(UUID.randomUUID().toString());
}
Map<String, TestEntity > map = dbConnection.getMap(DB_NAME);
map.put(entity.getId(), entity);
return map.get(entity.getId());
}
This is a how long is a piece of string question...
A method will be thread safe if it uses the synchronized keyword in its declaration.
However, even if your setId and getId methods used synchronized keyword, your process of setting the id (if it has not been previously initialized) above is not. .. but even then there is an "it depends" aspect to the question. If it is impossible for two threads to ever get the same object with an uninitialised id then you are thread safe because you would never be attempting to concurrently modifying the id.
It is entirely possible, given the code in your question, that there could be two calls to the thread safe getid at the same time for the same object. One by one they get the return value (null) and immediately get pre-empted to let the other thread run. This means both will then run the thread safe setId method - again one by one.
You could declare the whole save method as synchronized, but if you do that the entire method will be single threaded which defeats the purpose of using threads in the first place. You tend to want to minimize the synchronized code to the bare minimum to maximize concurrency.
You could also put a synchronized block around the critical if statement and minimise the single threaded part of the processing, but then you would also need to be careful if there were other parts of the code that might also set the Id if it wasn't previously initialized.
Another possibility which has various pros and cons is to put the initialization of the Id into the get method and make that method synchronized, or simply assign the Id when the object is created in the constructor.
I hope this helps...
Edit...
The above talks about java language features. A few people mentioned facilities in the java class libraries (e.g. java.util.concurrent) which also provide support for concurrency. So that is a good add on, but there are also whole packages which address the concurrency and other related parallel programming paradigms (e.g. parallelism) in various ways.
To complete the list I would add tools such as Akka and Cats-effect (concurrency) and more.
Not to mention the books and courses devoted to the subject.
I just reread your question and noted that you are asking about databases. Again the answer is it depends. Rdbms' usually let you do this type of operation with record locks usually in a transaction. Some (like teradata) use special clauses such as locking row for write select * from some table where pi_cols = 'somevalues' which locks the rowhash to you until you update it or certain other conditions. This is known as pessimistic locking.
Others (notebly nosql) have optimistic locking. This is when you read the record (like you are implying with getid) there is no opportunity to lock the record. Then you do a conditional update. The conditional update is sort of like this: write the id as x provided that when you try to do so the Id is still null (or whatever the value was when you checked). These types of operations are usually down through an API.
You can also do optimistics locking in an RDBMs as follows:
SQL
Update tbl
Set x = 'some value',
Last_update_timestamp = current_timestamp()
Where x = bull AND last_update_timestamp = 'same value as when I last checked'
In this example the second part of the where clause is the critical bit which basically says "only update the record if no one else did and I trust that everyone else will update the last update to when they do". The "trust" bit can sometimes be replaced by triggers.
These types of database operations (if available) are guaranteed by the database engine to be "thread safe".
Which loops me back to the "how long is a piece of string" observation at the beginning of this answer...
Test-and-set is unsafe
a method that checks for a id in the db and if the id doesn't exit then inserts a value with that id.
Any test-and-set pair of operations on a shared resource is inherently unsafe, vulnerable to a race condition. If the two operations are separate (not atomic), then they must be protected as a pair. While one thread completes the test but has not yet done the set, another thread could sneak in and do both the test and the set. The first thread now completes its set without knowing a duplicate action has occurred.
Providing that necessary protection is too broad a topic for an Answer on Stack Overflow, as others have said here.
UPSERT
However, let me point out that an alternative approach to to make the test-and-set atomic.
In the context of a database, that can be done using the UPSERT feature. Also known as a Merge operation. For example, in Postgres 9.5 and later we have the INSERT INTO … ON CONFLICT command. See this explanation for details.
In the context of a Boolean-style flag, a semaphore makes the test-and-set atomic.
In general, when we say "a method is thread-safe" when there is no race-condition to the internal and external data structure of the object it belongs to. In other words, the order of the method calls are strictly enforced.
For example, let's say you have a HashMap object and two threads, thread_a and thread_b.
thread_a calls put("a", "a") and thread_b calls put("a", "b").
The put method is not thread-safe (refer to its documentation) in the sense that while thread_a is executing its put, thread_b can also go in and execute its own put.
A put contains reading and writing part.
thread_a.read("a")
thread_b.read("a")
thread_b.write("a", "b")
thread_a.write("a", "a")
If above sequence happens, you can say ... a method is not thread-safe.
How to make a method thread-safe is by ensuring the state of the whole object cannot be perturbed while the thread-safe method is executing. An easier way is to put "synchronized" keyword in method declarations.
If you are worried about performance, use manual locking using synchronized blocks with a lock object. Further performance improvement can be achieved using a very well designed semaphores.
I encountered the following question in a recent System Design Interview:
Design an AppServer that interfaces with a Cache and a DB.
I came up with this:
public class AppServer{
public Database DB;
public Cache cache;
public Value get(Key k){
Value res = cache.get(k);
if(res == null){
res = DB.get(k);
cache.set(k, res);
}
}
public void set(Key k, Value v){
cache.set(k, v);
DB.set(k, v);
}
}
This code is fine and works correctly, but follow ups to the question are:
What if there are multiple threads?
What if there are multiple instances of the AppServer?
Suddenly AppServer performance degrades a ton, we find out this is because our cache is consistently missing. Cache size is fixed (already largest that it can be). How can we prevent this?
Response:
I answered that we can use Locks or Conditional Variables. In Java, we can add Synchronized to each method to allow for mutual exclusion, but the interviewer mentioned that this isn't too efficient and wanted only critical parts synchronized.
I thought that we only need to synchronize the 2 set lines in void set(Key k, Value v) and 1 set method in Value get(Key k), however the interviewer pushed for also synchronizing res = DB.get(k);. I agreed with him at the end, but don't fully understand. Don't threads have independent stacks and shared heaps? So when a thread executes get, it stores res in local variable on stack frame, even if another thread executes get sequentially, the former thread retains its get value. Then each thread sets their respective fetched values.
How can we handle multiple instances of the AppServer?
I came up with a Distributed Queue Solution like Kafka, every time we perform a set / get command we queue that command, but he also mentioned that set is ok because the action sets a value in the cache / db, but how would you return the correct value for get? Can someone explain this?
Also there are possible solutions with a versioning system and event system?
Possible solutions:
L1, L2, L3 caches - layers and more caches
Regional / Segmentation caches - use different cache for user groups.
Any other ideas?
Will upvote all insightful responses :)
1
Although JDBC is "supposed" to be thread safe, some drivers aren't and I'm going to assume that Cache isn't thread safe either (although most caches should be thread safe) so in that case, you would need to make the following changes to your code:
Make both fields final
Synchronize the ENTIRE get(...)method
Synchronize the ENTIRE set(...)method
Assuming there is no other way to access the said fields, the behavior of your get(...) method depends on 2 things: first, that updates from the set(...) method can be seen, and secondly, that a cache miss is then stored only by a single thread. You need to synchronize because the idea is to only have one thread perform an expensive DB query in the case that there is a cache miss. If you do not synchronize the entire get(...) method, or you split the synchronized statement, it is possible for another thread to also see a cache miss between the lookup and insertion.
The way I would answer this question is honestly just to toss the entire thing. I would look at how JCIP wrote the cache and base my answer on that.
2
I think your queue solution is fine.
I believe your interviewer means that if another instance of AppServer did not have cached what was already set(...) by another instance of AppServer, then it would lookup and find the correct value in the DB. This solution would be incorrect if you are using multiple threads because it is possible for 2 threads to be set(...)ing conflicting values, then the caches would have 2 different values while depending on the thread safety of your DB, it might not even have the value at all.
Ideally, you'd never create more than a single instance of your AppServer.
3
I don't have enough experience to evaluate this question specifically, but perhaps an LRU cache would improve performance somewhat, or using a hash ring buffer. It might be a stretch but if you wanted to throw out there, perhaps even using ML to determine the best values to either preload to retain at certain times of the day, for example, could also work.
If you are always missing values from your cache, there is no way to improve your code. Performance would be dependent on your database.
I'm having trouble understanding the synchronized keyword. As far as I know, it is used to make sure that only one thread can access the synchronized method/block at the same time. Then, is there sometimes a reason to make some methods synchronized if only one thread calls them?
If your program is single threaded, there's no need to synchronize methods.
Another case would be that you write a library and indicate that it's not thread safe. The user would then be responsible for handling possible multi-threading use, but you could write it all without synchronization.
If you are sure your class will be always used under single thread there is no reason to use any synchronized methods. But, the reality is - Java is inherently multi threaded environment. At some point of time somebody will use multiple threads. Therefore whichever class needs thread safety should have adequately synchonized methods/synchronized blocks to avoid problems.
No, you don't need synchronization if there is single thread involved.
Always specify thread-safety policy
Actually you never know how a class written by you is going to be used by others in future. So it is always better to explicitly state your policy. So that if in future someone tries to use it in multi-threaded way then they can be aware of the implications.
And the best place to specify the thread-safety policy is in JavaDocs. Always specify in JavaDocs as to whether the class that you are creating is thread safe or not.
When two or more threads need access to a shared resource, they need some way to ensure that the resource will be used by only one thread at a time.
Synchronized method is used to block the Shared resource between the multiple Threads.
So, No need to apply Synchronization for the Single Thread
Consider that you are designing a movie ticket seller application. And lets drop all the technology capabilities that are provided these days, for the sake of visualizing the problem.
There is only one ticket left for the show with 5 different counters selling tickets. Consider there are 2 people trying to buy the last ticket of the show at the counters.
Consider your application workflow to be such
You take in details of the buyer, his name, and his credit card
number. (this is the read operation)
Then you find out how many tickets are left for the show (this
is again a read operation)
Then you book the ticket with the credit card (this is the write
operation)
If this logic isnt synchronised, what would happen?
The details of Customer 1 and Customer 2 would be read up until step 2. Both will try to book the ticket and both their tickets would be booked.
If it is modified to be
You take in details of the buyer, his name, and his credit card
number. (this is the read operation)
Synchronize(
Then you find out how many tickets are left for the show (this is
again a read operation)
Then you book the ticket with the credit card (this is the write
operation) )
There is no chance of overbooking the show due to a thread race condition.
Now, consider this example where you absolutely know that there will be only and only 1 person booking tickets. There is no need for synchronization.
The ticket seller here would be your single thread in case of your
application
I have tried to put this in a very very simplistic manner. There are frameworks, and constraints which you put on the DB to avoid such a simple scenario. But the intent of the answer is to prove the theory of why thread synchronization, and not the capabilities of the way to avoid it.
Let's say customer has a Credit Card. And he has balance of x amount of money and he is buying y valued item(y<x). And again he is going to buy another item witch will cost z.
(y+z>x but z<x) .
Now I am going to simulate this scenario in Java. If all transaction happens in sequential, there is no need to panic. Customer can buy y valued item and then he don't have enough credit to buy other one.
But when we come in to multi-threaded environment we have to deal with some locking mechanism or some strategy. Because if some other thread read credit card object before reflect changes by previous thread serious issues will rise.
As far as I can see one way is we can keep a copy of original balance and we can check current value just before update the balance. If value is same as original one then we can make sure other threads doesn't change the balance. If balance different then we have to undo our calculation.
And again Java Synchronization also a good solution. Now my question is what will be the best approach to implement in such a scenario?
Additionally if we are going to see this in big picture. Synchronization hits the performance of the system. Since it is locked the object and other thread has to wait.
I will prefer to have a ReadWriteLock, this helps to to lock it for reading and writing, this is nice because you can have separate read and write lock for each resource:
ReadWriteLock readWriteLock = new ReentrantReadWriteLock();
readWriteLock.readLock().lock();
// multiple readers can enter this section
// if not locked for writing, and not writers waiting
// to lock for writing.
readWriteLock.readLock().unlock();
readWriteLock.writeLock().lock();
// only one writer can enter this section,
// and only if no threads are currently reading.
readWriteLock.writeLock().unlock();
ReadWriteLock internally keeps two Lock instances. One guarding read access, and one guarding write access.
Your proposal doesn't fit. You can't be sure the context switch doesn't happen between check and update.
The only way is synchronization.
What you are talking about sounds like software transactional memory. You optimistically assume that no other threads will modify the data upon which your transaction depends, but you have a mechanism to detect if they have.
The types in the java.util.concurrent.atomic package can help build a lock-free solution. They implement efficient compare-and-swap operations. For example, an AtomicInteger reference would allow you to to something like this:
AtomicInteger balance = new AtomicInteger();
…
void update(int change) throws InsufficientFundsException {
int original, updated;
do {
original = balance.get();
updated = original + change;
if (updated < 0)
throw new InsufficientFundsException();
} while (!balance.compareAndSet(original, update));
}
As you can see, such an approach is subject to a livelocked thread condition, where other threads continually change the balance, causing one thread to loop forever. In practice, specifics of your application determine how likely a livelock is.
Obviously, this approach is complex and loaded with pitfalls. If you aren't a concurrency expert, it's safer to use a lock to provide atomicity. Locking usually performs well enough if code inside the synchronized block doesn't perform any blocking operations, like I/O. If code in critical sections has a definite execution time, you are probably better off using a lock.
As far as I can see one way is we can keep a copy of original balance
and we can check current value just before update the balance. If
value is same as original one then we can make sure other threads
doesn't change the balance. If balance different then we have to undo
our calculation.
Sounds like what AtomicInteger.compareAndSet() and AtomicLong.compareAndSet() do.
An easier-to-understand approach would involve using synchronized methods on your CreditCard class that your code would call to update the balance. (Only one synchronized method on an object can execute at any one time.)
In this case, it sounds like you want a public synchronized boolean makePurchase(int cost) method that returns true on success and false on failure. The goal is that no transaction on your object should require more than one method call - as you've realized, you won't want to make two method calls on CreditCard (getBalance() and later setBalance()) to do the transaction, because of potential race conditions.
Today I was reading about how HashMap works in Java. I came across a blog and I am quoting directly from the article of the blog. I have gone through this article on Stack Overflow. Still
I want to know the detail.
So the answer is Yes there is potential race condition exists while
resizing HashMap in Java, if two thread at the same time found that
now HashMap needs resizing and they both try to resizing. on the
process of resizing of HashMap in Java , the element in bucket which
is stored in linked list get reversed in order during there migration
to new bucket because java HashMap doesn't append the new element at
tail instead it append new element at head to avoid tail traversing.
If race condition happens then you will end up with an infinite loop.
It states that as HashMap is not thread-safe during resizing of the HashMap a potential race condition can occur. I have seen in our office projects even, people are extensively using HashMaps knowing they are not thread safe. If it is not thread safe, why should we use HashMap then? Is it just lack of knowledge among developers as they might not be aware about structures like ConcurrentHashMap or some other reason. Can anyone put a light on this puzzle.
I can confidently say ConcurrentHashMap is a pretty ignored class. Not many people know about it and not many people care to use it. The class offers a very robust and fast method of synchronizing a Map collection. I have read a few comparisons of HashMap and ConcurrentHashMap on the web. Let me just say that they’re totally wrong. There is no way you can compare the two, one offers synchronized methods to access a map while the other offers no synchronization whatsoever.
What most of us fail to notice is that while our applications, web applications especially, work fine during the development & testing phase, they usually go tilts up under heavy (or even moderately heavy) load. This is due to the fact that we expect our HashMap’s to behave a certain way but under load they usually misbehave. Hashtable’s offer concurrent access to their entries, with a small caveat, the entire map is locked to perform any sort of operation.
While this overhead is ignorable in a web application under normal load, under heavy load it can lead to delayed response times and overtaxing of your server for no good reason. This is where ConcurrentHashMap’s step in. They offer all the features of Hashtable with a performance almost as good as a HashMap. ConcurrentHashMap’s accomplish this by a very simple mechanism.
Instead of a map wide lock, the collection maintains a list of 16 locks by default, each of which is used to guard (or lock on) a single bucket of the map. This effectively means that 16 threads can modify the collection at a single time (as long as they’re all working on different buckets). Infact there is no operation performed by this collection that locks the entire map.
There are several aspects to this: First of all, most of the collections are not thread safe. If you want a thread safe collection you can call synchronizedCollection or synchronizedMap
But the main point is this: You want your threads to run in parallel, no synchronization at all - if possible of course. This is something you should strive for but of course cannot be achieved every time you deal with multithreading.
But there is no point in making the default collection/map thread safe, because it should be an edge case that a map is shared. Synchronization means more work for the jvm.
In a multithreaded environment, you have to ensure that it is not modified concurrently or you can reach a critical memory problem, because it is not synchronized in any way.
Dear just check Api previously I also thinking in same manner.
I thought that the solution was to use the static Collections.synchronizedMap method. I was expecting it to return a better implementation. But if you look at the source code you will realize that all they do in there is just a wrapper with a synchronized call on a mutex, which happens to be the same map, not allowing reads to occur concurrently.
In the Jakarta commons project, there is an implementation that is called FastHashMap. This implementation has a property called fast. If fast is true, then the reads are non-synchronized, and the writes will perform the following steps:
Clone the current structure
Perform the modification on the clone
Replace the existing structure with the modified clone
public class FastSynchronizedMap implements Map,
Serializable {
private final Map m;
private ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
.
.
.
public V get(Object key) {
lock.readLock().lock();
V value = null;
try {
value = m.get(key);
} finally {
lock.readLock().unlock();
}
return value;
}
public V put(K key, V value) {
lock.writeLock().lock();
V v = null;
try {
v = m.put(key, value);
} finally {
lock.writeLock().lock();
}
return v;
}
.
.
.
}
Note that we do a try finally block, we want to guarantee that the lock is released no matter what problem is encountered in the block.
This implementation works well when you have almost no write operations, and mostly read operations.
Hashmap can be used when a single thread has an access to it. However when multiple threads start accessing the Hashmap there will be 2 main problems:
1. resizing of hashmap is not gauranteed to work as expected.
2. Concurrent Modification exception would be thrown. This can also be thrown when its accessed by single thread to read and write onto the hashmap at the same time.
A workaround for using HashMap in multi-threaded environment is to initialize it with the expected number of objects' count, hence avoiding the need for a re-sizing.