Data structure for non-blocking aggregation of Thread values?

Data structure for non-blocking aggregation of Thread values? - java

Background:
I have a large thread-pool in java each process has some internal state.
I would like to gather some global information about the states -- to do that I have an associative commutative aggregation function (e.g. sum -- mine needs to be plug-able though).
The solution needs to have a fixed memory consumption and be log-free in best case not disturbing the pool at all. So no thread should need to require a log (or enter a synchronized area) when writing to the data-structure. The aggregated value is only read after the threads are done, so I don't need an accurate value all the time. Simply collecting all values and aggregate them after the pool is done might lead to memory problems.
The values are going to be more complex datatypes so I cannot use AtomicInteger etc.
My general Idea for the solution:
Have a log-free collection where all threads put their updates to. I don't even need the order of the events.
If it gets to big run the aggregation function on it (compacting it) while the threads continue filling it.
My question:
Is there a data structure that allows for something like that or do I need to implement it from scratch? I couldn't find anything that directly matches my problem. If I have to implement from scratch what would be a good non-blocking collection class to start from?

If the updates are infrequent (relatively speaking) and the aggregation function is fast, I would recommend aggregrating every time:
State myState;
AtomicReference<State> combinedState;
do
{
State original = combinedState.get();
State newCombined = Aggregate(original, myState);
} while(!combinedState.compareAndSet(original, newCombined));

I don't quite understand the question but I would, at first sight, suggest an IdentityHashMap where keys are (references to) your thread objects and values are where your thread objects write their statistics.
An IdentityHashMap only relies on reference equality, as such there would never be any conflict between two thread objects; you could pass a reference to that map to each thread (which would then call .get(this) on the map to get a reference to the collecting data structure), which would then collect the data it wants. Otherwise you could just pass a reference to the collecting data structure to the thread object.
Such a map is inherently thread safe for your use case, as long as you create the key/value pair for that thread before starting the thread, and because no thread object will ever modify the map anyway since they won't have a referece to it. With some management smartness you can even remove entries from this map, even if the map is not even thread-safe, once the thread is done with its work.
When all is done, you have a map whose values contains all the data collected.
Hope this helps... Reading the question again, in any case...

Related

Atomic updates of values in concurrent hash map - how to?

Task is to keep track of some running processes. Keeping that information in memory is just fine, so I'm using a concurrent hash map to store that data:
ConcurrentHashMap<String, ProcessMetaData> RUNNING_PROCESSES = new ConcurrentHashMap();
It's all good and fine with safely putting new objects in the map, problem is that state of those processes change so I have to update ProcessMetaData from time to time. I made ProcessMetaData immutable and use ConcurrentHashMap's compute() method to update values, but now the problem is ProcessMetaData gets more complicated and keeping it immutable gets hardly manageable. The question is - as long as I only update ProcessMetaData in atomic (as per javadoc) compute() method - the object may be mutable and overall things will still be thread-safe? Is my assumption correct?

As long as you only access the value within the function passed to compute, modifications made in that function are safe.
This, however, is a pointless theoretical view. The purpose of storing values into a collection or map, is to eventually retrieve and use them. And this is where the problems start.
The compute method returns the result value just like get returns the currently stored value. Once a caller starts using that value, this use may be concurrent to subsequent compute operations on the map. The get method may even retrieve the value while a compute operation is in progress. Allowing non-blocking retrieval operation is one of ConcurrentHashMap’s main features. Therefore, all kind of race conditions may occur.
So, using a mutable object and modifying an already stored value in compute is only safe, when you use the map as write-only memory, which is a far-fetched scenario. It might work when you use a different thread safe mechanism to ensure that all updates have been completed before starting to read the map, but your use case seems to be different.

Multiple message listeners to single data store. Efficient design

I have a data store that is written to by multiple message listeners. Each of these message listeners can also be in the hundreds of individual threads.
The data store is a PriorityBlockingQueue as it needs to order the inserted objects by a timestamp. To make checking of the queue of items efficient rather than looping over the queue a concurrent hashmap is used as a form of index.
private Map<String, SLAData> SLADataIndex = new ConcurrentHashMap<String, SLAData>();;
private BlockingQueue<SLAData> SLADataQueue;
Question 1 is this a acceptable design or should I just use the single PriorityBlockingQueue.
Each message listener performs an operation, these listeners are scaled up to multiple threads.
Insert Method so it inserts into both.
this.SLADataIndex.put(dataToWrite.getMessageId(), dataToWrite);
this.SLADataQueue.add(dataToWrite);
Update Method
this.SLADataIndex.get(messageId).setNodeId(
updatedNodeId);
Delete Method
SLATupleData data = this.SLADataIndex.get(messageId);
//remove is O(log n)
this.SLADataQueue.remove(data);
// remove from index
this.SLADataIndex.remove(messageId);
Question Two Using these methods is this the most efficient way? They have wrappers around them via another object for error handling.
Question Three Using a concurrent HashMap and BlockingQueue does this mean these operations are thread safe? I dont need to use a lock object?
Question Four When these methods are called by multiple threads and listeners without any sort of synchronized block, can they be called at the same time by different threads or listeners?

Question 1 is this a acceptable design or should I just use the single PriorityBlockingQueue.
Certainly you should try to use a single Queue. Keeping the two collections in sync is going to require a lot more synchronization complexity and worry in your code.
Why do you need the Map? If it is just to call setNodeId(...) then I would have the processing thread do that itself when it pulls from the Queue.
// processing thread
while (!Thread.currentThread().isInterrupted()) {
dataToWrite = queue.take();
dataToWrite.setNodeId(myNodeId);
// process data
...
}
Question Two Using these methods is this the most efficient way? They have wrappers around them via another object for error handling.
Sure, that seems fine but, again, you will need to do some synchronization locking otherwise you will suffer from race conditions keeping the 2 collections in sync.
Question Three Using a concurrent HashMap and BlockingQueue does this mean these operations are thread safe? I dont need to use a lock object?
Both of those classes (ConcurrentHashMap and the BlockingQueue implementations) are thread-safe, yes. BUT since there are two of them, you can have race conditions where one collection has been updated but the other one has not. Most likely, you will have to use a lock object to ensure that both collections are properly kept in sync.
Question Four When these methods are called by multiple threads and listeners without any sort of synchronized block, can they be called at the same time by different threads or listeners?
That's a tough question to answer without seeing the code in question. For example. someone might be calling Insert(...) and has added it to the Map but not the queue yet, when another thread else calls Delete(...) and the item would get found in the Map and removed but the queue.remove() would not find it in the queue since the Insert(...) has not finished in the other thread.

Iteration of ConcurrentHashMap

I was reading about ConcurrentHashMap.
I read that it provides an Iterator that requires no synchronization and even allows the Map to be modified during iteration and thus there will be no ConcurrentModificationException.
I was wondering if this is a good thing as I might not get the element, put into ConcurrentHashMap earlier, during iteration as another thread might have changed it.
Is my thinking correct? If yes, is it good or bad?

I was wondering if this is a good thing as I might not get the element, put into ConcurrentHashMap earlier, during iteration as another thread might have changed it.
I don't think this should be a concern - the same statement is true if you use synchronization and the thread doing the iteration happens to grab the lock and execute it's loop prior to the thread that would insert the value.
If you need some sort of coordination between your threads to ensure that some action takes place after (and only after) another action, then you still need to manage this coordination, regardless of the type of Map used.

Usually, the ConcurrentHashMap weakly consistent iterator is sufficient. If instead you want a strongly consistent iterator, then you have a couple of options:
The ctrie is a hash array mapped trie that provides constant time snapshots. There is Java source code available for the data structure.
Clojure has a PersistentHashMap that you can use - this lets you iterate over a snapshot of the data.
Use a local database, e.g. HSQLDB to store the data instead of using a ConcurrentHashMap. Use a composite primary key of key|timestamp, and when you "update" a value you instead store a new entry with the current timestamp. To get an iterator, retrieve a resultset with a where timetamp < System.currentTimeMillis() clause, and iterate over the resultset.
In either case you're iterating over a snapshot, so you've got a strongly consistent iterator; in the former case you run the risk of running out of memory, while the latter case is a more complex solution.

The whole point of concurrent -anything is that you acknowledge concurrent activity, and don't trust that all access is serialized. With most collections, you cannot expect inter-element consistency without working for it.
If you don't care about seeing the latest data, but want a consistent (but possibly old) view of data, have a look at purely functional structures like Finger Trees.

How to make cache thread safe

I have a instance of a object which performs very complex operation.
So in the first case I create an instance and save it it my own custom cache.
From next times whatever thread comes if he finds that a ready made object is already present in the cache they take it from the cache so as to be good in performance wise.
I was worried about what if two threads have the same instance. IS there a chance that the two threads can corrupt each other.
Map<String, SoftReference<CacheEntry<ClassA>>> AInstances= Collections.synchronizedMap(new HashMap<String, SoftReference<CacheEntry<ClassA>>>());

There are many possible solutions:
Use an existing caching solution like EHcache
Use the Spring framework which got an easy way to cache results of a method with a simple #Cacheable annotation
Use one of the synchronized maps like ConcurrentHashMap
If you know all keys in advance, you can use a lazy init code. Note that everything in this code is there for a reason; change anything in get() and it will break eventually (eventually == "your unit tests will work and it will break after running one year in production without any problem whatsoever").
ConcurrentHashMap is most simple to set up but it has simple way to say "initialize the value of a key once".
Don't try to implement the caching by yourself; multithreading in Java has become a very complex area with Java 5 and the advent of multi-core CPUs and memory barriers.
[EDIT] yes, this might happen even though the map is synchronized. Example:
SoftReference<...> value = cache.get( key );
if( value == null ) {
value = computeNewValue( key );
cache.put( key, value );
}
If two threads run this code at the same time, computeNewValue() will be called twice. The method calls get() and put() are safe - several threads can try to put at the same time and nothing bad will happen, but that doesn't protect you from problems which arise when you call several methods in succession and the state of the map must not change between them.

Assuming you are talking about singletons, simply use the "demand on initialization holder idiom" to make sure your "check" works across all JVM's. This will also make sure all threads which are requesting the same object concurrently wait till the initialization is over and be given back only valid object instance.
Here I'm assuming you want a single instance of the object. If not, you might want to post some more code.

Ok If I understand your problem correctly, you are worried that 2 objects changing the state of the shared object will corrupt each other.
The short answer is yes they will.
If the object is expensive in creation but is needed in a read only manner. I suggest you make it immutable, this way you get the benefit of it being fast in access and at the same time thread safe.
If the state should be writable but you don't actually need threads to see each others updates. You can simply load the object once in an immutable cache and just return copies to anyone who asks for the object.
Finally if your object needs to be writable and shared (for other reasons than it just being expensive to create). Then my friend you need to handle thread safety, I don't know your case but you should take a look at the synchronized keyword, Locks and java 5 concurrency features, Atomic types. I am sure one of them will satisfy your need and I sincerely wish that your case is one of the first 2 :)

If you only have a single instance of the Object, have a quick look at:
Thread-safe cache of one object in java
Other wise I can't recommend the google guava library enough, in particular look at the MapMaker class.

Is this thread-safe?

I want to make my class thread-safe without large overhead.
The instances will be seldom used concurrently, but it may happen.
Most of the class is immutable, there's only one mutable member used as a cache:
private volatile SoftReference<Map<String, Something>> cache
= new SoftReference(null);
which gets assigned in the constructor (not shared) like
Map<String, Something> tmp = new HashMap<String, Something>();
tmp.put("a", new Something("a");
tmp.put("b", new Something("b");
cache = new SoftReference(tmp);
After the assignment, the map gets never modified.
It's no problem, when two threads compute the cache in parallel, since the value will be the same.
The additional overhead of the word done twice is acceptable.
When a thread wouldn't see the value computed by another tread, it'd compute it unnecessary, and this is acceptable.
This wouldn't happen because of volatile.
When a thread sees value computed by another tread, it's fine.
The only possible problem would be a thread seeing inconsistent state (e.g. a partly filled map).
Can this happen?
Notes:
I really want the whole map being softly referenced, there's no use for a map using soft keys or values here.
I know about ConcurrentHashMap and will maybe use it anyway, but I'm curious, if using volatile only works.

The only possible problem would be a
thread seeing inconsistent state (e.g.
a partly filled map). Can this happen?
No. Actions performed within a thread must be performed as if they had been executed in order. Writing a volatile variable happens-before any read of that value. Hence, initialization of the map happens-before any thread reading the reference to the map from the field.

The problem with using a soft reference is that you can lose the whole map/cache after a GC. This means the performance of your application can be hit very hard. You are better off using a cache with an eviction policy so that you never have this problem.
The volatile doesn't make any operation safe here.
You haven't shown all your code, perhaps we could offer some suggestion on how you could improve your code e.g. your sample code should compile ;)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.