Jira: Thread-safe Gadget Data? - java

I have some data (two HashSets and a timestamp Instant) that I'd like all requests to my JIRA (OpenSocial?) gadget/plugin to share -- because it takes a long time to generate (couple of minutes) and because the sharing will help the requests be more performant.
Occasionally (very rarely), a request might include a parameter that indicates this shared data should be refreshed. And of course the first time it's needed, it gets populated. It is okay for the data to represent a stale answer -- it is based on things that change slowly and used to visualize trends so off-by-one errors are tolerable.
I imagine when JIRA starts up (or I upload a new version of my add-on) and multiple requests come in during the first couple of minutes, I'd need to handle the population of this expensive shared data in a thread-safe way. Currently the results look fine but as I understand it, that's been just due to chance.
Only one thread needs to do the work of populating. On start-up, the other threads will have to wait of course because they can't skip ahead empty-handed. (If all threads do the expensive initialization, that's a lot of unnecessary load on the server)
But after the initial cost, if multiple concurrent requests come in and one of them includes the 'refresh' parameter, only that one thread needs to pay the price -- I'm fine with the other threads using an old copy of the expensive data and thereby staying performant, and including in the response that "yes someone out there is refreshing the data but here's a result using an old copy".
More about the data: The two HashSets and the timestamp are intended to represent a consistent snapshot in time. The HashSet contents depend on values in the database only, and the timestamp is just the time of the most recent refresh. None of this data depends on any earlier snapshot in time. And none of it depends on program state either. The timestamp is only used to answer the question "how old is this data" in a rough sense. Every time the data is refreshed, I'd expect the timestamp to be more recent but nothing is going to break if it's wrong. It's just for debugging and transparency. Since a snapshot doesn't depend on earlier snapshots or the program state, it could be wrapped and marked as volatile.
Is there an obvious choice for the best way to go about this? Pros and cons of alternatives?

You'll want to use Locks to synchronize access to the sections of your code that you need to have only one thread executing at once. There are plenty of resources on SO and in the Oracle Java docs that show how to use locks in more detail, but something like this should do the trick.
The idea is that you want to maintain a copy of the most-recently generated set of results, and you always return that copy until you have a new set of data available.
import java.util.concurrent.locks.ReentrantLock;
public class MyClass
{
private volatile MyObject completedResults;
private final ReentrantLock resultsLock;
private final ReentrantLock refreshLock;
public MyClass()
{
// This must be a singleton class (such as a servlet) for this to work, since every
// thread needs to be accessing the same lock.
resultsLock = new ReentrantLock();
refreshLock = new ReentrantLock();
}
public MyObject myMethodToRequestResults(boolean refresh)
{
MyObject resultsToReturn;
// Serialize access to get the most-recently completed set of results; if none exists,
// we need to generate it and all requesting threads need to wait.
resultsLock.lock();
try
{
if (completedResults == null)
{
completedResults = generateResults();
refresh = false; // we just generated it, so no point in redoing it below
}
resultsToReturn = completedResults;
}
finally
{
resultsLock.unlock();
}
if (refresh)
{
// If someone else is regenerating, we just return the old data and tell the caller that.
if (!refreshLock.tryLock())
{
// create a copy of the results to return, since we're about to modify it on the next line
// and we don't want to change the (shared) original!
resultsToReturn = new MyObject(resultsToReturn);
resultsToReturn.setSomeoneElseIsRegeneratingTheStuffRightNow(true);
}
else
{
try
{
completedResults = generateResults();
resultsToReturn = completedResults;
}
finally
{
refreshLock.unlock();
}
}
}
return resultsToReturn;
}
}

Related

Design AppServer Interview Discussion

I encountered the following question in a recent System Design Interview:
Design an AppServer that interfaces with a Cache and a DB.
I came up with this:
public class AppServer{
public Database DB;
public Cache cache;
public Value get(Key k){
Value res = cache.get(k);
if(res == null){
res = DB.get(k);
cache.set(k, res);
}
}
public void set(Key k, Value v){
cache.set(k, v);
DB.set(k, v);
}
}
This code is fine and works correctly, but follow ups to the question are:
What if there are multiple threads?
What if there are multiple instances of the AppServer?
Suddenly AppServer performance degrades a ton, we find out this is because our cache is consistently missing. Cache size is fixed (already largest that it can be). How can we prevent this?
Response:
I answered that we can use Locks or Conditional Variables. In Java, we can add Synchronized to each method to allow for mutual exclusion, but the interviewer mentioned that this isn't too efficient and wanted only critical parts synchronized.
I thought that we only need to synchronize the 2 set lines in void set(Key k, Value v) and 1 set method in Value get(Key k), however the interviewer pushed for also synchronizing res = DB.get(k);. I agreed with him at the end, but don't fully understand. Don't threads have independent stacks and shared heaps? So when a thread executes get, it stores res in local variable on stack frame, even if another thread executes get sequentially, the former thread retains its get value. Then each thread sets their respective fetched values.
How can we handle multiple instances of the AppServer?
I came up with a Distributed Queue Solution like Kafka, every time we perform a set / get command we queue that command, but he also mentioned that set is ok because the action sets a value in the cache / db, but how would you return the correct value for get? Can someone explain this?
Also there are possible solutions with a versioning system and event system?
Possible solutions:
L1, L2, L3 caches - layers and more caches
Regional / Segmentation caches - use different cache for user groups.
Any other ideas?
Will upvote all insightful responses :)
1
Although JDBC is "supposed" to be thread safe, some drivers aren't and I'm going to assume that Cache isn't thread safe either (although most caches should be thread safe) so in that case, you would need to make the following changes to your code:
Make both fields final
Synchronize the ENTIRE get(...)method
Synchronize the ENTIRE set(...)method
Assuming there is no other way to access the said fields, the behavior of your get(...) method depends on 2 things: first, that updates from the set(...) method can be seen, and secondly, that a cache miss is then stored only by a single thread. You need to synchronize because the idea is to only have one thread perform an expensive DB query in the case that there is a cache miss. If you do not synchronize the entire get(...) method, or you split the synchronized statement, it is possible for another thread to also see a cache miss between the lookup and insertion.
The way I would answer this question is honestly just to toss the entire thing. I would look at how JCIP wrote the cache and base my answer on that.
2
I think your queue solution is fine.
I believe your interviewer means that if another instance of AppServer did not have cached what was already set(...) by another instance of AppServer, then it would lookup and find the correct value in the DB. This solution would be incorrect if you are using multiple threads because it is possible for 2 threads to be set(...)ing conflicting values, then the caches would have 2 different values while depending on the thread safety of your DB, it might not even have the value at all.
Ideally, you'd never create more than a single instance of your AppServer.
3
I don't have enough experience to evaluate this question specifically, but perhaps an LRU cache would improve performance somewhat, or using a hash ring buffer. It might be a stretch but if you wanted to throw out there, perhaps even using ML to determine the best values to either preload to retain at certain times of the day, for example, could also work.
If you are always missing values from your cache, there is no way to improve your code. Performance would be dependent on your database.

Get/Set the value in the cache using the AtomicReference in java

I've already posted this question on codereview site https://codereview.stackexchange.com/questions/158999/get-set-the-value-in-the-cache-using-the-atomicreference-in-java , but thought of posting here, so that it reaches the wider audience and i can get the quicker solution posting it here as well.
I am having below code which get and set the data in the cache using the synchronized block and i want to know if i can optimize the below code :-
public int getValue() {
AtomicReferenceTest<Integer> cachedIntRef = new AtomicReference<Integer>();
boolean wasCached = true;
Integer cachedInt = cachedIntRef.get();
if (cachedInt == null) {
synchronized (cachedIntRef) {
cachedInt = cachedIntRef.get();
if (cachedInt == null) {
wasCached = false;
// Make DB call to get the data and update the cache.
cachedInt = baseDao.getCloudMaximumWeight();
cachedIntRef.set(cachedInt);
}
}
}
}
I want to know if is there is any way by which i can remove the synchronized block and optimize further or this code is already optimized?
EDIT :- i'll remove the question from one of the site, if i get the answer on any of the site. Also when i profile my application sometime even with less no of threads, i see threads blocking on synchronized piece of code. which made me think as i code is using the AtomicRef , somehow i can get rid of syncronized or is there is some other better way of optimize the code.
I want to know if is there is any way by which i can remove the synchronized block and optimize further or this code is already optimized?
I assume that optimizing the code means removing the synchronized block. The problem with that thinking is that most likely your dao call is significantly more expensive than synchronized. Any IO (especially to a remote database) is going to be at least 4+ orders of magnitude more expensive than the locking.
That said, you can remove the synchronized block if you don't mind multiple DAO calls when initializing the cache. If the DAO calls are inexpensive then having 2 threads making them maybe isn't a problem. There is a race condition on which one's answer will be put into the cache but chances are their results will be the same anyway. I often do this and assume that as the application starts up, the first couple of calls are going to be more expensive as the cache warms. But are 2 threads making the same DAO request ever going to be faster than 1 thread doing it and 1 waiting for the other thread to finish?
If there is a number of different DAO calls then you can try some sort of lock segregation so not all cache requests go through the same lock. This would allow some parallelization which might help. I can't tell if your code is specific or an example of the problem. This is how the ConcurrentHashMap works for example.
But really I would be sure that this section of code has performance problems before I worry too much about it. And even if a profiler is saying that it is a primary time sink, it may just be that the DAO calls are the most expensive part of the equation so saving a couple with synchronization would be the best way to speed it up anyway. You can take out the dao calls and replace with a straight assignment if you need to see if it the synchronized or dao.* calls that is the problem.
Try using volatile integer instead. Maybe I am missing something here but I don't see the use case for the AtomicReference here.

threads accessing non-synchronised methods in Java

can I ask to explain me how threads and synchronisation works in Java?
I want to write a high-performance application. Inside this application, I read a data from files into some nested classes, which are basically a nut-shell around HashMap.
After the data reading is finished, I start threads which need to go through the data and perform different checks on it. However, threads never change the data!
If I can guarantee (or at least try to guarantee;) that my threads never change the data, can I use them calling non-synchronised methods of objects containing data?
If multiple threads access the non-synchronised method, which does not change any class field, but has some internal variables, is it safe?
artificial example:
public class Data{
// this hash map is filled before I start threads
protected Map<Integer, Spike> allSpikes = new HashMap<Integer, Spike>();
public HashMap returnBigSpikes(){
Map<Integer, Spike> bigSpikes = new HashMap<Integer, Spike>();
for (Integer i: allSpikes.keySet()){
if (allSpikes.get(i).spikeSize > 100){
bigSpikes.put(i,allSpikes.get(i));
}
}
return bigSpikes;
}
}
Is it safe to call a NON-synchronised method returnBigSpikes() from threads?
I understand now that such use-cases are potentially very dangerous, because it's hard to control, that data (e.g., returned bigSpikes) will not be modified. But I have already implemented and tested it like this and want to know if I can use results of my application now, and change the architecture later...
What happens if I make the methods synchronised? Will be the application slowed down to 1 CPU performance? If so, how can I design it correctly and keep the performance?
(I read about 20-40 Gb of data (log messages) into the main memory and then run threads, which need to go through the all data to find some correlation in it; each thread becomes only a part of messages to analyse; but for the analysis, the thread should compare each message from its part with many other messages from data; that's why I first decided to allow threads to read data without synchronisation).
Thank You very much in advance.
If allSpikes is populated before all the threads start, you could make sure it isn't changed later by saving it as an unmodifiable map.
Assuming Spike is immutable, your method would then be perfectly safe to use concurrently.
In general, if you have a bunch of threads where you can guarantee that only one thread will modify a resource and the rest will only read that resource, then access to that resource doesn't need to be synchronised. In your example, each time the method returnBigSpikes() is invoked it creates a new local copy of bigSpikes hashmap, so although you're creating a hashmap it is unique to each thread, so no sync'ing problems there.
As long as anything practically immutable (eg. using final keyword) and you use an unmodifiableMap everything is fine.
I would suggest the following UnmodifiableData:
public class UnmodifiableData {
final Map<Integer,Spike> bigSpikes;
public UnmodifiableData(Map<Integer,Spike> bigSpikes) {
this.bigSpikes = Collections.unmodifiableMap(new HashMap<>(bigSpikes));
}
....
}
Your plan should work fine. You do not need to synchronize reads, only writes.
If, however, in the future you wish to cache bigSpikes so that all threads get the same map then you need to be more careful about synchronisation.
If you use ConcurrentHashMap, it will do all syncronization work for you. Its bettr, then making synronization around ordinary HashMap.
Since allSpikes is initialized before you start threads it's safe. Concurrency problems appear only when a thread writes to a resource and others read from it.

Thread.sleep() in a while loop

I notice that NetBeans is warning me about using Thread.sleep() in a while loop in my Java code, so I've done some research on the subject. It seems primarily the issue is one of performance, where your while condition may become true while the counter is still sleeping, thus wasting wall-clock time as you wait for the next iteration. This all makes perfect sense.
My application has a need to contact a remote system and periodically poll for the state of an operation, waiting until the operation is complete before sending the next request. At the moment the code logically does this:
String state = get state via RPC call
while (!state.equals("complete")) {
Thread.sleep(10000); // Wait 10 seconds
state = {update state via RPC call}
}
Given that the circumstance is checking a remote operation (which is a somewhat expensive process, in that it runs for several seconds), is this a valid use of Thread.sleep() in a while loop? Is there a better way to structure this logic? I've seen some examples where I could use a Timer class, but I fail to see the benefit, as it still seems to boil down to the same straightforward logic above, but with a lot more complexity thrown in.
Bear in mind that the remote system in this case is neither under my direct control, nor is it written in Java, so changing that end to be more "cooperative" in this scenario is not an option. My only option for updating my application's value for state is to create and send an XML message, receive a response, parse it, and then extract the piece of information I need.
Any suggestions or comments would be most welcome.
Unless your remote system can issue an event or otherwise notify you asynchronously, I don't think the above is at all unreasonable. You need to balance your sleep() time vs. the time/load that the RPC call makes, but I think that's the only issue and the above doesn't seem of concern at all.
Without being able to change the remote end to provide a "push" notification that it is done with its long-running process, that's about as well as you're going to be able to do. As long as the Thread.sleep time is long compared to the cost of polling, you should be OK.
You should (almost) never use sleep since its very inefficient and its not a good practice. Always use locks and condition variables where threads signal each other. See Mike Dahlin's Coding Standards for Programming with threads
A template is:
public class Foo{
private Lock lock;
private Condition c1;
private Condition c2;
public Foo()
{
lock = new SimpleLock();
c1 = lock.newCondition();
c2 = lock.newCondition();
...
}
public void doIt()
{
try{
lock.lock();
...
while(...){
c1.awaitUninterruptibly();
}
...
c2.signal();
}
finally{
lock.unlock();
}
}
}

Avoiding multiple repopulations of the same cache region (due to concurrency)

I have a high traffic website and I use hibernate. I also use ehcache to cache some entities and queries which are required to generate the pages.
The problem is "parallel cache misses" and the long explanation is that when the application boots and the cache regions are cold each cache region is being populated many times (instead of only once) by different threads because the site is being hit by many users at the same time. In addition, when some cache region invalidates it's being repopulated many times because of the same reason.
How can I avoid this?
I managed to convert 1 entity and 1 query cache to a BlockingCache by providing my own implementation to hibernate.cache.provider_class but the semantics of BlockingCache do not seem to work. Even worst sometimes the BlockingCache deadlocks (blocks) and the application hangs completely. Thread dump shows that processing is blocked on the mutex of BlockingCache on a get operation.
So, the question is, does Hibernate support this kind of use?
And if not, how do you solve this problem on production?
Edit: The hibernate.cache.provider_class points to my custom cache provider which is a copy paste from SingletonEhCacheProvider and at the end of the start() method (after line 136) I do:
Ehcache cache = manager.getEhcache("foo");
if (!(cache instanceof BlockingCache)) {
manager.replaceCacheWithDecoratedCache(cache, new BlockingCache(cache));
}
That way upon initialization, and before anyone else touches the cache named "foo", I decorate it with BlockingCache. "foo" is a query cache and "bar" (same code but omitted) is an entity cache for a pojo.
Edit 2: "Doesn't seem to work" means that the initial problem still exists. Cache "foo" is still being re-populated many times with the same data, because of the concurrency. I validate this by stressing the site with JMeter with 10 threads. I'd expect the 9 threads to block until the first one which requested data from "foo" to finish it's job (execute queries, store data in cache), and then get the data directly from the cache.
Edit 3: Another explanation for this problem can be seen at https://forum.hibernate.org/viewtopic.php?f=1&t=964391&start=0 but with no definite answer.
I'm not quite sure, but:
It allows concurrent read access to
elements already in the cache. If the
element is null, other reads will
block until an element with the same
key is put into the cache.
Doesn't it means that Hibernate would wait until some other thread places the object into cache? That's what you observe, right?
Hib and cache works like this:
Hib gets a request for an object
Hib checks if the object is in cache -- cache.get()
No? Hib loads the object from DB and puts into cache -- cache.put()
So if the object is not in cache (not placed there by some previous update operation), Hib would wait on 1) forever.
I think you need a cache variant where the thread only waits for an object for a short time. E.g. 100ms. If the object is not arrived, the thread should get null (and thus Hibernate will load the object from DB and place into the cache).
Actually, a better logic would be:
Check that another thread is requesting the same object
If true, wait for long (500ms) for the object to arrive
If not true, return null immediately
(We cannot wait on 2 forever, as the thread may fail to put the object into cache -- due to exception).
If BlockingCache doesn't support this behaviour, you need to implement a cache yourself. I did it in past, it's not hard -- main methods are get() and put() (though API apparently has grown since that).
UPDATE
Actually, I just read the sources of BlockingCache. It does exactly what I said -- lock and wait for timeout. Thus you don't need to do anything, just use it...
public Element get(final Object key) throws RuntimeException, LockTimeoutException {
Sync lock = getLockForKey(key);
Element element;
acquiredLockForKey(key, lock, LockType.WRITE);
element = cache.get(key);
if (element != null) {
lock.unlock(LockType.WRITE);
}
return element;
}
public void put(Element element) {
if (element == null) {
return;
}
Object key = element.getObjectKey();
Object value = element.getObjectValue();
getLockForKey(key).lock(LockType.WRITE);
try {
if (value != null) {
cache.put(element);
} else {
cache.remove(key);
}
} finally {
getLockForKey(key).unlock(LockType.WRITE);
}
}
So it's kind of strange it doesn't work for you. Tell me something: in your code this spot:
Ehcache cache = manager.getEhcache("foo");
is it synchronized? If multiple requests come at the same time, will there be only one instance of cache?
The biggest improvement on this issue is that ehcache now (since 2.1) supports the transactional hibernate cache policy. This vastly mitigates the problems described in this issue.
In order to go a step further (lock threads while accessing the same query cache region) one would need to implement a QueryTranslatorFactory to return custom (extended) QueryTranslatorImpl instances which would inspect query and parameters and block as necessary in the list method. This of course regards the specific use case of query cache using hql which fetch many entities.

Categories

Resources