How to make cache thread safe

How to make cache thread safe - java

I have a instance of a object which performs very complex operation.
So in the first case I create an instance and save it it my own custom cache.
From next times whatever thread comes if he finds that a ready made object is already present in the cache they take it from the cache so as to be good in performance wise.
I was worried about what if two threads have the same instance. IS there a chance that the two threads can corrupt each other.
Map<String, SoftReference<CacheEntry<ClassA>>> AInstances= Collections.synchronizedMap(new HashMap<String, SoftReference<CacheEntry<ClassA>>>());

There are many possible solutions:
Use an existing caching solution like EHcache
Use the Spring framework which got an easy way to cache results of a method with a simple #Cacheable annotation
Use one of the synchronized maps like ConcurrentHashMap
If you know all keys in advance, you can use a lazy init code. Note that everything in this code is there for a reason; change anything in get() and it will break eventually (eventually == "your unit tests will work and it will break after running one year in production without any problem whatsoever").
ConcurrentHashMap is most simple to set up but it has simple way to say "initialize the value of a key once".
Don't try to implement the caching by yourself; multithreading in Java has become a very complex area with Java 5 and the advent of multi-core CPUs and memory barriers.
[EDIT] yes, this might happen even though the map is synchronized. Example:
SoftReference<...> value = cache.get( key );
if( value == null ) {
value = computeNewValue( key );
cache.put( key, value );
}
If two threads run this code at the same time, computeNewValue() will be called twice. The method calls get() and put() are safe - several threads can try to put at the same time and nothing bad will happen, but that doesn't protect you from problems which arise when you call several methods in succession and the state of the map must not change between them.

Assuming you are talking about singletons, simply use the "demand on initialization holder idiom" to make sure your "check" works across all JVM's. This will also make sure all threads which are requesting the same object concurrently wait till the initialization is over and be given back only valid object instance.
Here I'm assuming you want a single instance of the object. If not, you might want to post some more code.

Ok If I understand your problem correctly, you are worried that 2 objects changing the state of the shared object will corrupt each other.
The short answer is yes they will.
If the object is expensive in creation but is needed in a read only manner. I suggest you make it immutable, this way you get the benefit of it being fast in access and at the same time thread safe.
If the state should be writable but you don't actually need threads to see each others updates. You can simply load the object once in an immutable cache and just return copies to anyone who asks for the object.
Finally if your object needs to be writable and shared (for other reasons than it just being expensive to create). Then my friend you need to handle thread safety, I don't know your case but you should take a look at the synchronized keyword, Locks and java 5 concurrency features, Atomic types. I am sure one of them will satisfy your need and I sincerely wish that your case is one of the first 2 :)

If you only have a single instance of the Object, have a quick look at:
Thread-safe cache of one object in java
Other wise I can't recommend the google guava library enough, in particular look at the MapMaker class.

Related

Is Hashmap's containsKey method threadsafe if the map is initialized once, and is never modified again

Can we use Hashmap's containsKey() method without synchronizing in an multi-threaded environment?
Note: Threads are only going to read the Hashmap. The map is initialized once, and is never modified again.

It really depends on how/when your map is accessed.
Assuming the map is initialized once, and never modified again, then methods that don't modify the internal state like containsKey() should be safe.
In this case though, you should make sure your map really is immutable, and is published safely.
Now if in your particular case the state does change during the course of your program, then no, it is not safe.
From the documentation:
Note that this implementation is not synchronized.
If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally.
In this case, you should use ConcurrentHashMap, or synchronize externally.

You shouldn't look at a single method this way. A HashMap is not meant to be used in a multi-threaded setup.
Having said that, the one exception would be: a map that gets created once (single threaded), and afterwards is "read" only. In other words: if a map doesn't get changed anymore, then you can have as many threads reading it as you want.
From that point of view, just containsKey() calls shouldn't call a problem. The problem arises when the information that this method relies on changes over time.

No, it is not thread-safe for any operations. You need to synchronise all access or use something like ConcurrentHashMap.
My favourite production system troubleshooting horror story is when we found that HashMap.get went into an infinite loop locking up 100% CPU forever because of missing synchronisation. This happened because the linked lists that were used within each bucket got into an inconsistent state. The same could happen with containsKey.
You should be safe if no one modifies the HashMap after it has been initially published, but better use an implementation that guarantees this explicitly (such as ImmutableMap or, again, a ConcurrentMap).

No. (No it is not. Not at all. 30 characters?)

It's complicated, but, mostly, no.
The spec of HashMap makes no guarantees whatsoever. It therefore reserves the right to blast yankee doodle dandy from your speakers if you try: You're just not supposed to use it that way.
... however, in practice, whilst the API of HashMap makes no guarantees, generally it works out. But, mind the horror story of #Thilo's answer.
... buuut, the Java Memory Model works like this: You should consider that each thread gets an individual copy of each and every field across the entire heap of the VM. These individual copies are then synced up at indeterminate times. That means that all sorts of code simply isn't going to work right; you add an entry to the map from one thread, and if you then access that map from another you won't see it even though a lot of time has passed – that's theoretically possible. Also, internally, map uses multiple fields and presumably these fields must be consistent with each other or you'll get weird behaviours (exceptions and wrong results). The JMM makes no guarantees about consistency either. The way out of this dilemma is that the JMM offers these things called 'comes-before/comes-after' relationships which give you guarantees that changes have been synced up. Using the 'synchronized' keyword is one easy way to get such relationships going.
Why not use a ConcurrentHashMap which has all the bells and whistles built in and does in fact guarantee that adding an entry from thread A and then querying it via containsKey from thread B will get you a consistent answer (which might still be 'no, that key is not in the map', because perhaps thread B got there slightly before thread A or slightly after but there's no way for you to know. It won't throw any exceptions or do something really bizarre such as returning 'false' for things you added ages ago all of a sudden).
So, whilst it's complicated, the answer is basically: Don't do that; either use a synchronized guard, or, probably the better choice: ConcurrentHashMap.

No, Read the bold part of HashMap documentation:
Note that this implementation is not synchronized.
So you should handle it:
If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally.
And suggested solutions:
This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be "wrapped" using the Collections.synchronizedMap method

#user7294900 is right.
If your application does not modifies the HashMap structurally which is build thread-safely and your application just invoke containsKey method, it's thread safe.
For instance, I've used HashMap like this:
#Component
public class SpringSingletonBean {
private Map<String, String> map = new HashMap<>();
public void doSomething() {
//
if (map.containsKey("aaaa")) {
//do something
}
}
#PostConstruct
public void init() {
// do something to initialize the map
}
}
It works well.

Data structure for non-blocking aggregation of Thread values?

Background:
I have a large thread-pool in java each process has some internal state.
I would like to gather some global information about the states -- to do that I have an associative commutative aggregation function (e.g. sum -- mine needs to be plug-able though).
The solution needs to have a fixed memory consumption and be log-free in best case not disturbing the pool at all. So no thread should need to require a log (or enter a synchronized area) when writing to the data-structure. The aggregated value is only read after the threads are done, so I don't need an accurate value all the time. Simply collecting all values and aggregate them after the pool is done might lead to memory problems.
The values are going to be more complex datatypes so I cannot use AtomicInteger etc.
My general Idea for the solution:
Have a log-free collection where all threads put their updates to. I don't even need the order of the events.
If it gets to big run the aggregation function on it (compacting it) while the threads continue filling it.
My question:
Is there a data structure that allows for something like that or do I need to implement it from scratch? I couldn't find anything that directly matches my problem. If I have to implement from scratch what would be a good non-blocking collection class to start from?

If the updates are infrequent (relatively speaking) and the aggregation function is fast, I would recommend aggregrating every time:
State myState;
AtomicReference<State> combinedState;
do
{
State original = combinedState.get();
State newCombined = Aggregate(original, myState);
} while(!combinedState.compareAndSet(original, newCombined));

I don't quite understand the question but I would, at first sight, suggest an IdentityHashMap where keys are (references to) your thread objects and values are where your thread objects write their statistics.
An IdentityHashMap only relies on reference equality, as such there would never be any conflict between two thread objects; you could pass a reference to that map to each thread (which would then call .get(this) on the map to get a reference to the collecting data structure), which would then collect the data it wants. Otherwise you could just pass a reference to the collecting data structure to the thread object.
Such a map is inherently thread safe for your use case, as long as you create the key/value pair for that thread before starting the thread, and because no thread object will ever modify the map anyway since they won't have a referece to it. With some management smartness you can even remove entries from this map, even if the map is not even thread-safe, once the thread is done with its work.
When all is done, you have a map whose values contains all the data collected.
Hope this helps... Reading the question again, in any case...

threads accessing non-synchronised methods in Java

can I ask to explain me how threads and synchronisation works in Java?
I want to write a high-performance application. Inside this application, I read a data from files into some nested classes, which are basically a nut-shell around HashMap.
After the data reading is finished, I start threads which need to go through the data and perform different checks on it. However, threads never change the data!
If I can guarantee (or at least try to guarantee;) that my threads never change the data, can I use them calling non-synchronised methods of objects containing data?
If multiple threads access the non-synchronised method, which does not change any class field, but has some internal variables, is it safe?
artificial example:
public class Data{
// this hash map is filled before I start threads
protected Map<Integer, Spike> allSpikes = new HashMap<Integer, Spike>();
public HashMap returnBigSpikes(){
Map<Integer, Spike> bigSpikes = new HashMap<Integer, Spike>();
for (Integer i: allSpikes.keySet()){
if (allSpikes.get(i).spikeSize > 100){
bigSpikes.put(i,allSpikes.get(i));
}
}
return bigSpikes;
}
}
Is it safe to call a NON-synchronised method returnBigSpikes() from threads?
I understand now that such use-cases are potentially very dangerous, because it's hard to control, that data (e.g., returned bigSpikes) will not be modified. But I have already implemented and tested it like this and want to know if I can use results of my application now, and change the architecture later...
What happens if I make the methods synchronised? Will be the application slowed down to 1 CPU performance? If so, how can I design it correctly and keep the performance?
(I read about 20-40 Gb of data (log messages) into the main memory and then run threads, which need to go through the all data to find some correlation in it; each thread becomes only a part of messages to analyse; but for the analysis, the thread should compare each message from its part with many other messages from data; that's why I first decided to allow threads to read data without synchronisation).
Thank You very much in advance.

If allSpikes is populated before all the threads start, you could make sure it isn't changed later by saving it as an unmodifiable map.
Assuming Spike is immutable, your method would then be perfectly safe to use concurrently.

In general, if you have a bunch of threads where you can guarantee that only one thread will modify a resource and the rest will only read that resource, then access to that resource doesn't need to be synchronised. In your example, each time the method returnBigSpikes() is invoked it creates a new local copy of bigSpikes hashmap, so although you're creating a hashmap it is unique to each thread, so no sync'ing problems there.

As long as anything practically immutable (eg. using final keyword) and you use an unmodifiableMap everything is fine.
I would suggest the following UnmodifiableData:
public class UnmodifiableData {
final Map<Integer,Spike> bigSpikes;
public UnmodifiableData(Map<Integer,Spike> bigSpikes) {
this.bigSpikes = Collections.unmodifiableMap(new HashMap<>(bigSpikes));
}
....
}

Your plan should work fine. You do not need to synchronize reads, only writes.
If, however, in the future you wish to cache bigSpikes so that all threads get the same map then you need to be more careful about synchronisation.

If you use ConcurrentHashMap, it will do all syncronization work for you. Its bettr, then making synronization around ordinary HashMap.

Since allSpikes is initialized before you start threads it's safe. Concurrency problems appear only when a thread writes to a resource and others read from it.

In Java can I depend on reference assignment being atomic to implement copy on write?

If I have an unsynchronized java collection in a multithreaded environment, and I don't want to force readers of the collection to synchronize[1], is a solution where I synchronize the writers and use the atomicity of reference assignment feasible? Something like:
private Collection global = new HashSet(); // start threading after this
void allUpdatesGoThroughHere(Object exampleOperand) {
// My hypothesis is that this prevents operations in the block being re-ordered
synchronized(global) {
Collection copy = new HashSet(global);
copy.remove(exampleOperand);
// Given my hypothesis, we should have a fully constructed object here. So a
// reader will either get the old or the new Collection, but never an
// inconsistent one.
global = copy;
}
}
// Do multithreaded reads here. All reads are done through a reference copy like:
// Collection copy = global;
// for (Object elm: copy) {...
// so the global reference being updated half way through should have no impact
Rolling your own solution seems to often fail in these type of situations, so I'd be interested in knowing other patterns, collections or libraries I could use to prevent object creation and blocking for my data consumers.
[1] The reasons being a large proportion of time spent in reads compared to writes, combined with the risk of introducing deadlocks.
Edit: A lot of good information in several of the answers and comments, some important points:
A bug was present in the code I posted. Synchronizing on global (a badly named variable) can fail to protect the syncronized block after a swap.
You could fix this by synchronizing on the class (moving the synchronized keyword to the method), but there may be other bugs. A safer and more maintainable solution is to use something from java.util.concurrent.
There is no "eventual consistency guarantee" in the code I posted, one way to make sure that readers do get to see the updates by writers is to use the volatile keyword.
On reflection the general problem that motivated this question was trying to implement lock free reads with locked writes in java, however my (solved) problem was with a collection, which may be unnecessarily confusing for future readers. So in case it is not obvious the code I posted works by allowing one writer at a time to perform edits to "some object" that is being read unprotected by multiple reader threads. Commits of the edit are done through an atomic operation so readers can only get the pre-edit or post-edit "object". When/if the reader thread gets the update, it cannot occur in the middle of a read as the read is occurring on the old copy of the "object". A simple solution that had probably been discovered and proved to be broken in some way prior to the availability of better concurrency support in java.

Rather than trying to roll out your own solution, why not use a ConcurrentHashMap as your set and just set all the values to some standard value? (A constant like Boolean.TRUE would work well.)
I think this implementation works well with the many-readers-few-writers scenario. There's even a constructor that lets you set the expected "concurrency level".
Update: Veer has suggested using the Collections.newSetFromMap utility method to turn the ConcurrentHashMap into a Set. Since the method takes a Map<E,Boolean> my guess is that it does the same thing with setting all the values to Boolean.TRUE behind-the-scenes.
Update: Addressing the poster's example
That is probably what I will end up going with, but I am still curious about how my minimalist solution could fail. – MilesHampson
Your minimalist solution would work just fine with a bit of tweaking. My worry is that, although it's minimal now, it might get more complicated in the future. It's hard to remember all of the conditions you assume when making something thread-safe—especially if you're coming back to the code weeks/months/years later to make a seemingly insignificant tweak. If the ConcurrentHashMap does everything you need with sufficient performance then why not use that instead? All the nasty concurrency details are encapsulated away and even 6-months-from-now you will have a hard time messing it up!
You do need at least one tweak before your current solution will work. As has already been pointed out, you should probably add the volatile modifier to global's declaration. I don't know if you have a C/C++ background, but I was very surprised when I learned that the semantics of volatile in Java are actually much more complicated than in C. If you're planning on doing a lot of concurrent programming in Java then it'd be a good idea to familiarize yourself with the basics of the Java memory model. If you don't make the reference to global a volatile reference then it's possible that no thread will ever see any changes to the value of global until they try to update it, at which point entering the synchronized block will flush the local cache and get the updated reference value.
However, even with the addition of volatile there's still a huge problem. Here's a problem scenario with two threads:
We begin with the empty set, or global={}. Threads A and B both have this value in their thread-local cached memory.
Thread A obtains obtains the synchronized lock on global and starts the update by making a copy of global and adding the new key to the set.
While Thread A is still inside the synchronized block, Thread B reads its local value of global onto the stack and tries to enter the synchronized block. Since Thread A is currently inside the monitor Thread B blocks.
Thread A completes the update by setting the reference and exiting the monitor, resulting in global={1}.
Thread B is now able to enter the monitor and makes a copy of the global={1} set.
Thread A decides to make another update, reads in its local global reference and tries to enter the synchronized block. Since Thread B currently holds the lock on {} there is no lock on {1} and Thread A successfully enters the monitor!
Thread A also makes a copy of {1} for purposes of updating.
Now Threads A and B are both inside the synchronized block and they have identical copies of the global={1} set. This means that one of their updates will be lost! This situation is caused by the fact that you're synchronizing on an object stored in a reference that you're updating inside your synchronized block. You should always be very careful which objects you use to synchronize. You can fix this problem by adding a new variable to act as the lock:
private volatile Collection global = new HashSet(); // start threading after this
private final Object globalLock = new Object(); // final reference used for synchronization
void allUpdatesGoThroughHere(Object exampleOperand) {
// My hypothesis is that this prevents operations in the block being re-ordered
synchronized(globalLock) {
Collection copy = new HashSet(global);
copy.remove(exampleOperand);
// Given my hypothesis, we should have a fully constructed object here. So a
// reader will either get the old or the new Collection, but never an
// inconsistent one.
global = copy;
}
}
This bug was insidious enough that none of the other answers have addressed it yet. It's these kinds of crazy concurrency details that cause me to recommend using something from the already-debugged java.util.concurrent library rather than trying to put something together yourself. I think the above solution would work—but how easy would it be to screw it up again? This would be so much easier:
private final Set<Object> global = Collections.newSetFromMap(new ConcurrentHashMap<Object,Boolean>());
Since the reference is final you don't need to worry about threads using stale references, and since the ConcurrentHashMap handles all the nasty memory model issues internally you don't have to worry about all the nasty details of monitors and memory barriers!

According to the relevant Java Tutorial,
We have already seen that an increment expression, such as c++, does not describe an atomic action. Even very simple expressions can define complex actions that can decompose into other actions. However, there are actions you can specify that are atomic:
Reads and writes are atomic for reference variables and for most primitive variables (all types except long and double).
Reads and writes are atomic for all variables declared volatile (including long and double variables).
This is reaffirmed by Section §17.7 of the Java Language Specification
Writes to and reads of references are always atomic, regardless of whether they are implemented as 32-bit or 64-bit values.
It appears that you can indeed rely on reference access being atomic; however, recognize that this does not ensure that all readers will read an updated value for global after this write -- i.e. there is no memory ordering guarantee here.
If you use an implicit lock via synchronized on all access to global, then you can forge some memory consistency here... but it might be better to use an alternative approach.
You also appear to want the collection in global to remain immutable... luckily, there is Collections.unmodifiableSet which you can use to enforce this. As an example, you should likely do something like the following...
private volatile Collection global = Collections.unmodifiableSet(new HashSet());
... that, or using AtomicReference,
private AtomicReference<Collection> global = new AtomicReference<>(Collections.unmodifiableSet(new HashSet()));
You would then use Collections.unmodifiableSet for your modified copies as well.
// ... All reads are done through a reference copy like:
// Collection copy = global;
// for (Object elm: copy) {...
// so the global reference being updated half way through should have no impact
You should know that making a copy here is redundant, as internally for (Object elm : global) creates an Iterator as follows...
final Iterator it = global.iterator();
while (it.hasNext()) {
Object elm = it.next();
}
There is therefore no chance of switching to an entirely different value for global in the midst of reading.
All that aside, I agree with the sentiment expressed by DaoWen... is there any reason you're rolling your own data structure here when there may be an alternative available in java.util.concurrent? I figured maybe you're dealing with an older Java, since you use raw types, but it won't hurt to ask.
You can find copy-on-write collection semantics provided by CopyOnWriteArrayList, or its cousin CopyOnWriteArraySet (which implements a Set using the former).
Also suggested by DaoWen, have you considered using a ConcurrentHashMap? They guarantee that using a for loop as you've done in your example will be consistent.
Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.
Internally, an Iterator is used for enhanced for over an Iterable.
You can craft a Set from this by utilizing Collections.newSetFromMap like follows:
final Set<E> safeSet = Collections.newSetFromMap(new ConcurrentHashMap<E, Boolean>());
...
/* guaranteed to reflect the state of the set at read-time */
for (final E elem : safeSet) {
...
}

I think your original idea was sound, and DaoWen did a good job getting the bugs out. Unless you can find something that does everything for you, it's better to understand these things than hope some magical class will do it for you. Magical classes can make your life easier and reduce the number of mistakes, but you do want to understand what they are doing.
ConcurrentSkipListSet might do a better job for you here. It could get rid of all your multithreading problems.
However, it is slower than a HashSet (usually--HashSets and SkipLists/Trees hard to compare). If you are doing a lot of reads for every write, what you've got will be faster. More importantly, if you update more than one entry at a time, your reads could see inconsistent results. If you expect that whenever there is an entry A there is an entry B, and vice versa, the skip list could give you one without the other.
With your current solution, to the readers, the contents of the map are always internally consistent. A read can be sure there's an A for every B. It can be sure that the size() method gives the precise number of elements that will be returned by the iterator. Two iterations will return the same elements in the same order.
In other words, allUpdatesGoThroughHere and ConcurrentSkipListSet are two good solutions to two different problems.

Can you use the Collections.synchronizedSet method? From HashSet Javadoc http://docs.oracle.com/javase/6/docs/api/java/util/HashSet.html
Set s = Collections.synchronizedSet(new HashSet(...));

Replace the synchronized by making global volatile and you'll be alright as far as the copy-on-write goes.
Although the assignment is atomic, in other threads it is not ordered with the writes to the object referenced. There needs to be a happens-before relationship which you get with a volatile or synchronising both reads and writes.
The problem of multiple updates happening at once is separate - use a single thread or whatever you want to do there.
If you used a synchronized for both reads and writes then it'd be correct but the performance may not be great with reads needing to hand-off. A ReadWriteLock may be appropriate, but you'd still have writes blocking reads.
Another approach to the publication issue is to use final field semantics to create an object that is (in theory) safe to be published unsafely.
Of course, there are also concurrent collections available.

Is this thread-safe?

I want to make my class thread-safe without large overhead.
The instances will be seldom used concurrently, but it may happen.
Most of the class is immutable, there's only one mutable member used as a cache:
private volatile SoftReference<Map<String, Something>> cache
= new SoftReference(null);
which gets assigned in the constructor (not shared) like
Map<String, Something> tmp = new HashMap<String, Something>();
tmp.put("a", new Something("a");
tmp.put("b", new Something("b");
cache = new SoftReference(tmp);
After the assignment, the map gets never modified.
It's no problem, when two threads compute the cache in parallel, since the value will be the same.
The additional overhead of the word done twice is acceptable.
When a thread wouldn't see the value computed by another tread, it'd compute it unnecessary, and this is acceptable.
This wouldn't happen because of volatile.
When a thread sees value computed by another tread, it's fine.
The only possible problem would be a thread seeing inconsistent state (e.g. a partly filled map).
Can this happen?
Notes:
I really want the whole map being softly referenced, there's no use for a map using soft keys or values here.
I know about ConcurrentHashMap and will maybe use it anyway, but I'm curious, if using volatile only works.

The only possible problem would be a
thread seeing inconsistent state (e.g.
a partly filled map). Can this happen?
No. Actions performed within a thread must be performed as if they had been executed in order. Writing a volatile variable happens-before any read of that value. Hence, initialization of the map happens-before any thread reading the reference to the map from the field.

The problem with using a soft reference is that you can lose the whole map/cache after a GC. This means the performance of your application can be hit very hard. You are better off using a cache with an eviction policy so that you never have this problem.
The volatile doesn't make any operation safe here.
You haven't shown all your code, perhaps we could offer some suggestion on how you could improve your code e.g. your sample code should compile ;)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.