Avoiding multiple repopulations of the same cache region (due to concurrency)

Avoiding multiple repopulations of the same cache region (due to concurrency) - java

I have a high traffic website and I use hibernate. I also use ehcache to cache some entities and queries which are required to generate the pages.
The problem is "parallel cache misses" and the long explanation is that when the application boots and the cache regions are cold each cache region is being populated many times (instead of only once) by different threads because the site is being hit by many users at the same time. In addition, when some cache region invalidates it's being repopulated many times because of the same reason.
How can I avoid this?
I managed to convert 1 entity and 1 query cache to a BlockingCache by providing my own implementation to hibernate.cache.provider_class but the semantics of BlockingCache do not seem to work. Even worst sometimes the BlockingCache deadlocks (blocks) and the application hangs completely. Thread dump shows that processing is blocked on the mutex of BlockingCache on a get operation.
So, the question is, does Hibernate support this kind of use?
And if not, how do you solve this problem on production?
Edit: The hibernate.cache.provider_class points to my custom cache provider which is a copy paste from SingletonEhCacheProvider and at the end of the start() method (after line 136) I do:
Ehcache cache = manager.getEhcache("foo");
if (!(cache instanceof BlockingCache)) {
manager.replaceCacheWithDecoratedCache(cache, new BlockingCache(cache));
}
That way upon initialization, and before anyone else touches the cache named "foo", I decorate it with BlockingCache. "foo" is a query cache and "bar" (same code but omitted) is an entity cache for a pojo.
Edit 2: "Doesn't seem to work" means that the initial problem still exists. Cache "foo" is still being re-populated many times with the same data, because of the concurrency. I validate this by stressing the site with JMeter with 10 threads. I'd expect the 9 threads to block until the first one which requested data from "foo" to finish it's job (execute queries, store data in cache), and then get the data directly from the cache.
Edit 3: Another explanation for this problem can be seen at https://forum.hibernate.org/viewtopic.php?f=1&t=964391&start=0 but with no definite answer.

I'm not quite sure, but:
It allows concurrent read access to
elements already in the cache. If the
element is null, other reads will
block until an element with the same
key is put into the cache.
Doesn't it means that Hibernate would wait until some other thread places the object into cache? That's what you observe, right?
Hib and cache works like this:
Hib gets a request for an object
Hib checks if the object is in cache -- cache.get()
No? Hib loads the object from DB and puts into cache -- cache.put()
So if the object is not in cache (not placed there by some previous update operation), Hib would wait on 1) forever.
I think you need a cache variant where the thread only waits for an object for a short time. E.g. 100ms. If the object is not arrived, the thread should get null (and thus Hibernate will load the object from DB and place into the cache).
Actually, a better logic would be:
Check that another thread is requesting the same object
If true, wait for long (500ms) for the object to arrive
If not true, return null immediately
(We cannot wait on 2 forever, as the thread may fail to put the object into cache -- due to exception).
If BlockingCache doesn't support this behaviour, you need to implement a cache yourself. I did it in past, it's not hard -- main methods are get() and put() (though API apparently has grown since that).
UPDATE
Actually, I just read the sources of BlockingCache. It does exactly what I said -- lock and wait for timeout. Thus you don't need to do anything, just use it...
public Element get(final Object key) throws RuntimeException, LockTimeoutException {
Sync lock = getLockForKey(key);
Element element;
acquiredLockForKey(key, lock, LockType.WRITE);
element = cache.get(key);
if (element != null) {
lock.unlock(LockType.WRITE);
}
return element;
}
public void put(Element element) {
if (element == null) {
return;
}
Object key = element.getObjectKey();
Object value = element.getObjectValue();
getLockForKey(key).lock(LockType.WRITE);
try {
if (value != null) {
cache.put(element);
} else {
cache.remove(key);
}
} finally {
getLockForKey(key).unlock(LockType.WRITE);
}
}
So it's kind of strange it doesn't work for you. Tell me something: in your code this spot:
Ehcache cache = manager.getEhcache("foo");
is it synchronized? If multiple requests come at the same time, will there be only one instance of cache?

The biggest improvement on this issue is that ehcache now (since 2.1) supports the transactional hibernate cache policy. This vastly mitigates the problems described in this issue.
In order to go a step further (lock threads while accessing the same query cache region) one would need to implement a QueryTranslatorFactory to return custom (extended) QueryTranslatorImpl instances which would inspect query and parameters and block as necessary in the list method. This of course regards the specific use case of query cache using hql which fetch many entities.

Related

Design AppServer Interview Discussion

I encountered the following question in a recent System Design Interview:
Design an AppServer that interfaces with a Cache and a DB.
I came up with this:
public class AppServer{
public Database DB;
public Cache cache;
public Value get(Key k){
Value res = cache.get(k);
if(res == null){
res = DB.get(k);
cache.set(k, res);
}
}
public void set(Key k, Value v){
cache.set(k, v);
DB.set(k, v);
}
}
This code is fine and works correctly, but follow ups to the question are:
What if there are multiple threads?
What if there are multiple instances of the AppServer?
Suddenly AppServer performance degrades a ton, we find out this is because our cache is consistently missing. Cache size is fixed (already largest that it can be). How can we prevent this?
Response:
I answered that we can use Locks or Conditional Variables. In Java, we can add Synchronized to each method to allow for mutual exclusion, but the interviewer mentioned that this isn't too efficient and wanted only critical parts synchronized.
I thought that we only need to synchronize the 2 set lines in void set(Key k, Value v) and 1 set method in Value get(Key k), however the interviewer pushed for also synchronizing res = DB.get(k);. I agreed with him at the end, but don't fully understand. Don't threads have independent stacks and shared heaps? So when a thread executes get, it stores res in local variable on stack frame, even if another thread executes get sequentially, the former thread retains its get value. Then each thread sets their respective fetched values.
How can we handle multiple instances of the AppServer?
I came up with a Distributed Queue Solution like Kafka, every time we perform a set / get command we queue that command, but he also mentioned that set is ok because the action sets a value in the cache / db, but how would you return the correct value for get? Can someone explain this?
Also there are possible solutions with a versioning system and event system?
Possible solutions:
L1, L2, L3 caches - layers and more caches
Regional / Segmentation caches - use different cache for user groups.
Any other ideas?
Will upvote all insightful responses :)

1
Although JDBC is "supposed" to be thread safe, some drivers aren't and I'm going to assume that Cache isn't thread safe either (although most caches should be thread safe) so in that case, you would need to make the following changes to your code:
Make both fields final
Synchronize the ENTIRE get(...)method
Synchronize the ENTIRE set(...)method
Assuming there is no other way to access the said fields, the behavior of your get(...) method depends on 2 things: first, that updates from the set(...) method can be seen, and secondly, that a cache miss is then stored only by a single thread. You need to synchronize because the idea is to only have one thread perform an expensive DB query in the case that there is a cache miss. If you do not synchronize the entire get(...) method, or you split the synchronized statement, it is possible for another thread to also see a cache miss between the lookup and insertion.
The way I would answer this question is honestly just to toss the entire thing. I would look at how JCIP wrote the cache and base my answer on that.
2
I think your queue solution is fine.
I believe your interviewer means that if another instance of AppServer did not have cached what was already set(...) by another instance of AppServer, then it would lookup and find the correct value in the DB. This solution would be incorrect if you are using multiple threads because it is possible for 2 threads to be set(...)ing conflicting values, then the caches would have 2 different values while depending on the thread safety of your DB, it might not even have the value at all.
Ideally, you'd never create more than a single instance of your AppServer.
3
I don't have enough experience to evaluate this question specifically, but perhaps an LRU cache would improve performance somewhat, or using a hash ring buffer. It might be a stretch but if you wanted to throw out there, perhaps even using ML to determine the best values to either preload to retain at certain times of the day, for example, could also work.
If you are always missing values from your cache, there is no way to improve your code. Performance would be dependent on your database.

How to note web requests in concurrent environment?

We have a web application which receives some million requests per day, we audit the request counts and response status using an interceptor, which intern calls a class annotated with #Async annotation of spring, this class basically adds them to a map and persists the map after a configured interval. As we have fixed set of api we maintain ConcurrentHashMap map having API name as key and its count and response status object as value.So for every request for an api we check whether it exists in our map , if exist we fetch the object against it otherwise we create an object and put it in map. For ex
class Audit{
CounterObject =null;
if(APIMap.contains(apiname){
// fetch existing object
CounterObject=APIMap.get(apiname);
}
else{
//create new object and put it to the map
CounterObject=new CounterObject();
}
// Increment count,note response status and other operations of the CounterObject recieved
}
Then we perform some calculation on the received object (whether from map or newly created) and update counters.
We aggreagate the map values for specific interval and commit it to database.
This works fine for less hits , but under a high load we face some issues. Like
1. First thread got the object and updated the count, but before updating second thread comes and gets the value which is not the latest one, by this time first thread has done the changes and commits the value , but the second threads updates the values it got previously and updated them. But as the key on which operation is performed is same for both the threads the counter is overwritten by the thread whichever writes last.
2. I don't want to put synchronized keyword over the block which has logic for updating the counter. As even if the processing is async and the user gets response even before we check apiname in map still the application resources consumed will be higher under high load if synchronized keyword is used , which can result in late response or in worst case a deadlock.
Can anyone suggest a solution which does can update the counters in concurrent way without having to use synchronized keyword.
Note :: I am already using ConcurrentHashMap but as the lock hold and release is so fast at high load by multiple threads , the counter mismatches.

In your case you are right to look at a solution without locking (or at least with very local locking). And as long as you do simple operations you should be able to pull this off.
First of all you have to make sure you only make one new CounterObject, instead of having multiple threads create one of their own and the last one overwriting earlier object.
ConcurrentHashMap has a very useful function for this: putIfAbsent. It will story an object if there is none and return the object that is in the map right after calling it (although the documentation doesn't state it as directly, the code example does). It works as follows:
CounterObject counter = APIMap.putIfAbsent("key", new CounterObject());
counter.countStuff();
The downside of the above is that you always create a new CounterObject, which might be expensive. If that is the case you can use the Java 8 computeIfAbsent which will only call a lambda to create the object if there is nothing associated with the key.
Finally you have to make sure you CounterObject is threadsafe, preferably without locking/sychronization (although if you have very many CounterObjects, locking on it will be less bad than locking the full map, because fewer threads will try to lock the same object at the same time).
In order to make CounterObject safe without locking, you can look into classes such as AtomicInteger which can do many simple operations without locking.
Note that whenever I say locking here it means either with an explicit lock class or by using synchronize.

The reason for counter mismatch is check and put operation in the Audit class is not atomic on ConcurrentHashMap. You need to use putIfAbsent method that performs check and put operation atomically. Refer ConcurrentHashMap javadoc for putIfAbsent method.

Jira: Thread-safe Gadget Data?

I have some data (two HashSets and a timestamp Instant) that I'd like all requests to my JIRA (OpenSocial?) gadget/plugin to share -- because it takes a long time to generate (couple of minutes) and because the sharing will help the requests be more performant.
Occasionally (very rarely), a request might include a parameter that indicates this shared data should be refreshed. And of course the first time it's needed, it gets populated. It is okay for the data to represent a stale answer -- it is based on things that change slowly and used to visualize trends so off-by-one errors are tolerable.
I imagine when JIRA starts up (or I upload a new version of my add-on) and multiple requests come in during the first couple of minutes, I'd need to handle the population of this expensive shared data in a thread-safe way. Currently the results look fine but as I understand it, that's been just due to chance.
Only one thread needs to do the work of populating. On start-up, the other threads will have to wait of course because they can't skip ahead empty-handed. (If all threads do the expensive initialization, that's a lot of unnecessary load on the server)
But after the initial cost, if multiple concurrent requests come in and one of them includes the 'refresh' parameter, only that one thread needs to pay the price -- I'm fine with the other threads using an old copy of the expensive data and thereby staying performant, and including in the response that "yes someone out there is refreshing the data but here's a result using an old copy".
More about the data: The two HashSets and the timestamp are intended to represent a consistent snapshot in time. The HashSet contents depend on values in the database only, and the timestamp is just the time of the most recent refresh. None of this data depends on any earlier snapshot in time. And none of it depends on program state either. The timestamp is only used to answer the question "how old is this data" in a rough sense. Every time the data is refreshed, I'd expect the timestamp to be more recent but nothing is going to break if it's wrong. It's just for debugging and transparency. Since a snapshot doesn't depend on earlier snapshots or the program state, it could be wrapped and marked as volatile.
Is there an obvious choice for the best way to go about this? Pros and cons of alternatives?

You'll want to use Locks to synchronize access to the sections of your code that you need to have only one thread executing at once. There are plenty of resources on SO and in the Oracle Java docs that show how to use locks in more detail, but something like this should do the trick.
The idea is that you want to maintain a copy of the most-recently generated set of results, and you always return that copy until you have a new set of data available.
import java.util.concurrent.locks.ReentrantLock;
public class MyClass
{
private volatile MyObject completedResults;
private final ReentrantLock resultsLock;
private final ReentrantLock refreshLock;
public MyClass()
{
// This must be a singleton class (such as a servlet) for this to work, since every
// thread needs to be accessing the same lock.
resultsLock = new ReentrantLock();
refreshLock = new ReentrantLock();
}
public MyObject myMethodToRequestResults(boolean refresh)
{
MyObject resultsToReturn;
// Serialize access to get the most-recently completed set of results; if none exists,
// we need to generate it and all requesting threads need to wait.
resultsLock.lock();
try
{
if (completedResults == null)
{
completedResults = generateResults();
refresh = false; // we just generated it, so no point in redoing it below
}
resultsToReturn = completedResults;
}
finally
{
resultsLock.unlock();
}
if (refresh)
{
// If someone else is regenerating, we just return the old data and tell the caller that.
if (!refreshLock.tryLock())
{
// create a copy of the results to return, since we're about to modify it on the next line
// and we don't want to change the (shared) original!
resultsToReturn = new MyObject(resultsToReturn);
resultsToReturn.setSomeoneElseIsRegeneratingTheStuffRightNow(true);
}
else
{
try
{
completedResults = generateResults();
resultsToReturn = completedResults;
}
finally
{
refreshLock.unlock();
}
}
}
return resultsToReturn;
}
}

How do I synchronize cache list access

I've got the following problem (one important restriction - cannot use external jar/libraries, only java primitives that come with regular install):
Objects of class X are stored long term in sql DB. Objects are cached for performance sake (needs to be written. Intend to base it on LinkedHashMap).
get(key):
check if object is in cache and not in use - return it.
if object is in use - sleep till it's available.
if object is not in cache - read it from DB.
putInCache(object):
update object in cache (if it's not there, add it).
if the cache is exhausted it will trigger a saveToDB operation by the cache and remove
from cache the least recent used item.
saveToDB(object):
write object to DB (not removed from cache) and mark object and "not changed".
There are multiple threads calling get. A thread can change the object it received from get (and the object will be marked as "changed") - when it's finished it will call putInCache.
There is one dedicated thread that goes over the cache objects and when it encounters a "changed" object it will trigger saveToDB (object will be marked as used while DB access is going on).
How would you recommend to ensure thread safety ?
Basically I'm looking for the right Java classes that will enable:
1. get to synchronize it's access to each object in the cache. So that it can check if it's there and if so - if it's used or free for grabbing. If it's used - it should sleep until it's available.
2. the dedicated thread should not lock the cache while calling saveToDB but still making sure all the cache is examined and no starvation is caused (the cache might change while saveToDB is running)
just to clarify I'm only interested in the locking/synchronization solutions - things like the cache triggering and DB access can be assumed as given.

Here is an approach:
use an ExecutorService to handle DB requests;
use Futures for your map values;
use a ConcurrentHashMap as a map implementation.
The Future should get from the DB; it will use the ExecutorService.
When you need to make manipulations on one object, synchronize on this future's .get() which will be the object.
Also, google for "Java concurrency in practice", and buy the book ;)

How to iterate over db records correctly with hibernate

I want to iterate over records in the database and update them. However since that updating is both taking some time and prone to errors, I need to a) don't keep the db waiting (as e.g. with a ScrollableResults) and b) commit after each update.
Second thing is that this is done in multiple threads, so I need to ensure that if thread A is taking care of a record, thread B is getting another one.
How can I implement this sensibly with hibernate?
To give a better idea, the following code would be executed by several threads, where all threads share a single instance of the RecordIterator:
Iterator<Record> iter = db.getRecordIterator();
while(iter.hasNext()){
Record rec = iter.next();
// do something lengthy here
db.save(rec);
}
So my question is how to implement the RecordIterator. If on every next() I perform a query, how to ensure that I don't return the same record twice? If I don't, which query to use to return detached objects? Is there a flaw in the general approach (e.g. use one RecordIterator per thread and let the db somehow handle synchronization)? Additional info: there are way to many records to locally keep them (e.g. in a set of treated records).
Update: Because the overall process takes some time, it can happen that the status of Records changes. Due to that the ordering of the result of a query can change. I guess to solve this problem I have to mark records in the database once I return them for processing...

Hmmm, what about pushing your objects from a reader thread in some bounded blocking queue, and let your updater threads read from that queue.
In your reader, do some paging with setFirstResult/setMaxResults. E.g. if you have 1000 elements maximum in your queue, fill them up 500 at a time. When the queue is full, the next push will automatically wait until the updaters take the next elements.

My suggestion would be, since you're sharing an instance of the master iterator, is to run all of your threads using a shared Hibernate transaction, with one load at the beginning and a big save at the end. You load all of your data into a single 'Set' which you can iterate over using your threads (be careful of locking, so you might want to split off a section for each thread, or somehow manage the shared resource so that you don't overlap).
The beauty of the Hibernate solution is that the records aren't immediately saved to the database, since you're using a transaction, and are stored in hibernate's cache. Then at the end they'd all be written back to the database at once. This would save on those expensive database writes you're worried about, plus it gives you an actual object to work with on each iteration, instead of just a database row.
I see in your update that the status of the records may change during processing, and this could always cause a problem. If this is a constantly running process or long running, then my advice using a hibernate solution would be to work in smaller sets, and yes, add a flag to mark records that have been updated, so that when you move to the next set you can pick up ones that haven't been touched.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.