Java concurrency: CopyOnWriteArrayList behavior - java

I need to store some objects into database.
First of all,
I store them on memory (into a collection)
When one of them is stored correctly on database, I remove it
So,
public class AuditService {
private CopyOnWriteArrayList<Audit> copyWrite;
public void flush(Audit... audits) {
Collection<Audit> auditCollection = Arrays.asList(audits);
this.copyWrite.addAll(auditCollection);
this.copyWrite.forEach(audit -> {
// save audit object on database
this.copyWrite.remove(audit);
});
}
}
I has to be thread-safe, I mean, AuditService is a singleton class, I several threads can reach at flush method at the same time.
My question is:
How does CopyOnWriteArrayList works exactly in order to solve concurrency.
Is this code correct?

CopyOnWriteArrayList offers thread safety by copying the underlying array when data changes. Mutator operations like addAll() in your example are synchronized internally by CopyOnWriteArrayList.
However your code makes little sense, since copyWrite field is not accessed outside of the flush() method. Local method variables are thread-safe so your code can be simplified to simply:
public void flush(Audit... audits) {
for (Audit a : audits) {
// save audit object on database
}
}
The problem is what happens if an Audit object gets modified. Hopefully you made them immutable as it makes little sense to change Audit events.

Related

When and how should I use additional synchronization of ConcurrentHashMap?

I need to know when I should add some synchronization block to my code when using ConcurrentHashMap. Let's say I have a method like:
private static final ConcurrentMap<String, MyObjectWrapper> myObjectsCache = new ConcurrentHashMap<>(CACHE_INITIAL_CAPACITY);
public List<MyObject> aMethod(List<String> ids, boolean b) {
List<MyObject> result = new ArrayList<>(ids.size());
for (String id : ids) {
if (id == null) {
continue;
}
MyObjectWrapper myObjectWrapper = myObjectsCache.get(id);
if (myObjectWrapper == null) {
continue;
}
if (myObjectWrapper.getObject() instanceof MyObjectSub) {
((MyObjectSub) myObjectWrapper.getObject()).clearAField();
myObjectWrapper.getObject().setTime(System.currentTimeMillis());
}
result.add(myObjectWrapper.getObject());
if (b) {
final MyObject obj = new MyObject(myObjectWrapper.getObject());
addObjectToDb(obj);
}
}
return result;
}
How should I efficiently make this method concurrent?
I think that the "get" is safe but once I get the value from cache and update the cached object's fields - there can be problems beacuse another thread could get the same wrapper and try to update the same underlying object... Should I add synchronization? And if so, then should I synchronize from "get" to end of loop iteration or the entire loop?
Maybe someone could share some more specific guidelines of proper and efficient use of ConcurrentHashMap when some more operations need to be done on the map keys/values inside loops etc...
I would be really grateful.
EDIT:
Some context for the question:
I'm currently working on refactoring of some dao classes in production code and a few of the classes used HashMaps for caching data retrieved from the database. All methods that used the cache (for write or reads) had their entire content inside a synchronized(cache) block (playing safe?). I don't have much experience with concurrency and I really want to use this opportunity to learn. I naively changed the HashMaps to ConcurrentHashMaps and now want to remove the synchronized bloocks where they're necessary. All caches are used for writes and reads. The presented method is based on one of the methods that I've changed and now I'm trying to learn when and to what extent synchronize. The methods clearAField just changes a value of one of the fields of the wrapped POJO object and addObjectToDb tries to add the object to the database.
Other example would be refilling of the cache:
public void findAll() throws SQLException{
// get data from database into a list
List<Data> data=getAllDataFromDatabase();
cacheCHM.clear();
cacheCHM.putAll(data);
}
In which case I should put the clear and putAll inside a synchronize(cacheCHM) block, right?
I've tried to find and read some posts/articles about the proper and efficient usage of CHM but most deal with single operations, without loops etc.... The best I've found would be:
http://www.javamadesoeasy.com/2015/04/concurrenthashmap-in-java.html
You've not mentioned what concurrency you expect to happen within your app, so I'm going to assume you have multiple threads calling aMethod, and nothing else.
You only have a single call to the ConcurrentHashMap: myObjectsCache.get(id), this is fine. In fact since nothing is writing data into your objectCache [see assumption above] you don't even need a ConcurrentHashMap! You'd be fine with any immutable collection. You have a suspicious line at the end: addObjectToDb(obj), does this method also affect your cache? If so it's still safe (probably, we'd have to see the method to be certain), but you definitely need the ConcurentHashMap.
The danger is where you change the objects, here:
myObjectWrapper.getObject().clearAField();
myObjectWrapper.getObject().setTime(System.currentTimeMillis());
It's possible for multiple threads to call these methods on the same object at the same time. Without knowing what these methods do, we can't say if this is safe or not. If these methods are both marked synchronised, or if you took care to ensure that it was safe for these methods to run concurrently then you're fine (but beware there's scope for these methods to run in different orders to what you might intuitively expect!). If you weren't so careful then there is a potential for data corruption.
A better approach to threadsaftey and caches is to use immutable objects. Here's what the MyObjectSub calss might look like if it were immutable [not sure why you need the wrapper - I'd omit that completely is possible]:
//Provided by way of example. You should consider generating these
//using http://immutables.github.io/ or similar
public class MyImmutableObject {
//If all members are final primitives or immutable objects
//then this class is threadsafe.
final String field;
final long time;
public MyImmutableObject(String field, long time) {
this.field = field;
this.time = time;
}
public MyImmutableObject clearField() {
//Since final fields can never be changed, our only option is to
//return a copy.
return new MyImmutableObject("", this.time);
}
public MyImmutableObject setTime(long newtime) {
return new MyImmutableObject(this.field, newtime);
}
}
If your objects are immutable then thread safety is a lot simpler. Your method would look something like this:
public List<Result> typicialCacheUsage(String key) {
MyImmutableObject obj = myObjectsCache.get(key);
obj = obj.clearField();
obj = obj.setTime(System.currentTimeMillis());
//If you need to put the object back in the cache you can do this:
myObjectsCache.put(key, obj);
List<Result> res = generateResultFromObject(obj);
return res;
}

Is this HashMap usage thread safe?

I have a static HashMap which will cache objects identifed by unique integers; it will be accessed from multiple threads. I will have multiple instances of the type HashmapUser running in different threads, each of which will want to utilize the same HashMap (which is why it's static).
Generally, the HashmapUsers will be retrieving from the HashMap. Though if it is empty, it needs to be populated from a Database. Also, in some cases the HashMap will be cleared because it needs the data has change and it needs to be repopulated.
So, I just make all interactions with the Map syncrhonized. But I'm not positive that this is safe, smart, or that it works for a static variable.
Is the below implementation of this thread safe? Any suggestions to simplify or otherwise improve it?
public class HashmapUser {
private static HashMap<Integer, AType> theMap = new HashSet<>();
public HashmapUser() {
//....
}
public void performTask(boolean needsRefresh, Integer id) {
//....
AType x = getAtype(needsRefresh, id);
//....
}
private synchronized AType getAtype(boolean needsRefresh, Integer id) {
if (needsRefresh) {
theMap.clear();
}
if (theMap.size() == 0) {
// populate the set
}
return theMap.get(id);
}
}
As it is, it is definitely not thread-safe. Each instance of HashmapUsers will use a different lock (this), which does nothing useful. You have to synchronise on the same object, such as the HashMap itself.
Change getAtype to:
private AType getAtype(boolean needsRefresh, Integer id) {
synchronized(theMap) {
if (needsRefresh) {
theMap.clear();
}
if (theMap.size() == 0) {
// populate the set
}
return theMap.get(id);
}
}
Edit:
Note that you can synchronize on any object, provided that all instances use the same object for synchronization. You could synchronize on HashmapUsers.class, which also allows for other objects to lock access to the map (though it is typically best practice to use a private lock).
Because of this, simply making your getAtype method static would work, since the implied lock would now be HashMapUsers.class instead of this. However, this exposes your lock, which may or may not be what you want.
No, this won't work at all.
If you don't specify lock object, e.g. declare method synchronized, the implicit lock will be instance. Unless the method is static then the lock will be class. Since there are multiple instances, there are also multiple locks, which i doubt is desired.
What you should do is create another class which is the only class with the access to HashMap.
Clients of HashMap, such as the HashMapUser must not even be aware that there is synchronization in place. Instead, thread safety should be assured by the proper class wrapping the HashMap hiding the synchronization from the clients.
This lets you easily add additional clients to the HashMap since synchronization is hidden from them, otherwise you would have to add some kind of synchronization between the different client types too.
I would suggest you go with either ConcurrentHashMap or SynchronizedMap.
More info here: http://crunchify.com/hashmap-vs-concurrenthashmap-vs-synchronizedmap-how-a-hashmap-can-be-synchronized-in-java/
ConcurrentHashMap is more suitable for high - concurrency scenarios. This implementation doesn't synchronize on the whole object, but rather does that in an optimised way, so different threads, accessing different keys can do that simultaneously.
SynchronizerMap is simpler and does synchronization on the object level - the access to the instance is serial.
I think you need performance, so I think you should probably go with ConcurrentHashMap.

Java - Is calling a synchronized getter function during a synchronized setter function the right way to manipulate a shared variable?

I have several threads trying to increment a counter for a certain key in a not thread-safe custom data structure (which you can image to be similiar to a HashMap). I was wondering what the right way to increment the counter in this case would be.
Is it sufficient to synchronize the increment function or do I also need to synchronize the get operation?
public class Example {
private MyDataStructure<Key, Integer> datastructure = new CustomDataStructure<Key, Integer>();
private class MyThread implements Runnable() {
private synchronized void incrementCnt(Key key) {
// from the datastructure documentation: if a value already exists for the given key, the
// previous value will be replaced by this value
datastructure.put(key, getCnt(key)+1);
// or can I do it without using the getCnt() function? like this:
datastructure.put(key, datastructure.get(key)+1));
}
private synchronized int getCnt(Key key) {
return datastructure.get(key);
}
// run method...
}
}
If I have two threads t1, t2 for example, I would to something like:
t1.incrementCnt();
t2.incrmentCnt();
Can this lead to any kind of deadlock? Is there a better way to solve this?
Main issue with this code is that it's likely to fail in providing synchronization access to datastructure, since accessing code synchronizing on this of an inner class. Which is different for different instances of MyThread, so no mutual exclusion will happen.
More correct way is to make datastructure a final field, and then to synchronize on it:
private final MyDataStructure<Key, Integer> datastructure = new CustomDataStructure<Key, Integer>();
private class MyThread implements Runnable() {
private void incrementCnt(Key key) {
synchronized (datastructure) {
// or can I do it without using the getCnt() function? like this:
datastructure.put(key, datastructure.get(key)+1));
}
}
As long as all data access is done using synchronized (datastructure), code is thread-safe and it's safe to just use datastructure.get(...). There should be no dead-locks, since deadlocks can occur only when there's more than one lock to compete for.
As the other answer told you, you should synchronize on your data structure, rather than on the thread/runnable object. It is a common mistake to try to use synchronized methods in the thread or runnable object. Synchronization locks are instance-based, not class-based (unless the method is static), and when you are running multiple threads, this means that there are actually multiple thread instances.
It's less clear-cut about Runnables: you could be using a single instance of your Runnable class with several threads. So in principle you could synchronize on it. But I still think it's bad form because in the future you may want to create more than one instance of it, and get a really nasty bug.
So the general best practice is to synchronize on the actual item that you are accessing.
Furthermore, the design conundrum of whether or not to use two methods should be solved by moving the whole thing into the data structure itself, if you can do so (if the class source is under your control). This is an operation that is confined to the data structure and applies only to it, and doing the increment outside of it is not good encapsulation. If your data structure exposes a synchronized incrementCnt method, then:
It synchronizes on itself, which is what you wanted.
It can use its own private fields directly, which means you don't actually need to call a getter and a setter.
It is free to have the implementation changed to one of the atomic structures in the future if it becomes possible, or add other implementation details (such as logging increment operations separately from setter access operations).

achieving synchronized addAll to a list in java

Updated the question.. please check secodn part of question
I need to build up a master list of book ids. I have multiple threaded tasks which brings up a subset of book ids. As soon as each task execution is completed, I need to add them to the super list of book ids. Hence I am planning to pass below aggregator class instance to all of my execution tasks and have them call the updateBookIds() method. To ensure it's thread safe, I have kept the addAll code in synchronized block.
Can any one suggest is this same as Synchronized list? Can I just say Collections.newSynchronizedList and call addAll to that list from all thread tasks? Please clarify.
public class SynchronizedBookIdsAggregator {
private List<String> bookIds;
public SynchronizedBookIdsAggregator(){
bookIds = new ArrayList<String>();
}
public void updateBookIds(List<String> ids){
synchronized (this) {
bookIds.addAll(ids);
}
}
public List<String> getBookIds() {
return bookIds;
}
public void setBookIds(List<String> bookIds) {
this.bookIds = bookIds;
}
}
Thanks,
Harish
Second Approach
So after below discussions, I am currently planning to go with below approach. Please let me know if I am doing anything wrong here:-
public class BooksManager{
private static Logger logger = LoggerFactory.getLogger();
private List<String> fetchMasterListOfBookIds(){
List<String> masterBookIds = Collections.synchronizedList(new ArrayList<String>());
List<String> libraryCodes = getAllLibraries();
ExecutorService libraryBookIdsExecutor = Executors.newFixedThreadPool(BookManagerConstants.LIBRARY_BOOK_IDS_EXECUTOR_POOL_SIZE);
for(String libraryCode : libraryCodes){
LibraryBookIdsCollectionTask libraryTask = new LibraryBookIdsCollectionTask(libraryCode, masterBookIds);
libraryBookIdsExecutor.execute(libraryTask);
}
libraryBookIdsExecutor.shutdown();
//Now the fetching of master list is complete.
//So I will just continue my processing of the master list
}
}
public class LibraryBookIdsCollectionTask implements Runnable {
private String libraryCode;
private List<String> masterBookIds;
public LibraryBookIdsCollectionTask(String libraryCode,List<String> masterBookIds){
this.libraryCode = libraryCode;
this.masterBookIds = masterBookIds;
}
public void run(){
List<String> bookids = new ArrayList<String>();//TODO get this list from iconnect call
synchronized (masterBookIds) {
masterBookIds.addAll(bookids);
}
}
}
Thanks,
Harish
Can I just say Collections.newSynchronizedList and call addAll to that list from all thread tasks?
If you're referring to Collections.synchronizedList, then yes, that would work fine. That will give you a object that implements the List interface where all of the methods from that interface are synchronized, including addAll.
Consider sticking with what you have, though, since it's arguably a cleaner design. If you pass the raw List to your tasks, then they get access to all of the methods on that interface, whereas all they really need to know is that there's an addAll method. Using your SynchronizedBookIdsAggregator keeps your tasks decoupled from design dependence on the List interface, and removes the temptation for them to call something other than addAll.
In cases like this, I tend to look for a Sink interface of some sort, but there never seems to be one around when I need it...
The code you have implemented does not create a synchronization point for someone who accesses the list via getBookIds(), which means they could see inconsistent data. Furthermore, someone who has retrieved the list via getBookIds() must perform external synchronization before accessing the list. Your question also doesn't show how you are actually using the SynchronizedBookIdsAggregator class, which leaves us with not enough information to fully answer your question.
Below would be a safer version of the class:
public class SynchronizedBookIdsAggregator {
private List<String> bookIds;
public SynchronizedBookIdsAggregator() {
bookIds = new ArrayList<String>();
}
public void updateBookIds(List<String> ids){
synchronized (this) {
bookIds.addAll(ids);
}
}
public List<String> getBookIds() {
// synchronized here for memory visibility of the bookIds field
synchronized(this) {
return bookIds;
}
}
public void setBookIds(List<String> bookIds) {
// synchronized here for memory visibility of the bookIds field
synchronized(this) {
this.bookIds = bookIds;
}
}
}
As alluded to earlier, the above code still has a potential problem with some thread accessing the ArrayList after it has been retrieved by getBookIds(). Since the ArrayList itself is not synchronized, accessing it after retrieving it should be synchronized on the chosen guard object:
public class SomeOtherClass {
public void run() {
SynchronizedBookIdsAggregator aggregator = getAggregator();
List<String> bookIds = aggregator.getBookIds();
// Access to the bookIds list must happen while synchronized on the
// chosen guard object -- in this case, aggregator
synchronized(aggregator) {
<work with the bookIds list>
}
}
}
I can imagine using Collections.newSynchronizedList as part of the design of this aggregator, but it is not a panacea. Concurrency design really requires an understanding of the underlying concerns, more than "picking the right tool / collection for the job" (although the latter is not unimportant).
Another potential option to look at is CopyOnWriteArrayList.
As skaffman alluded to, it might be better to not allow direct access to the bookIds list at all (e.g., remove the getter and setter). If you enforce that all access to the list must run through methods written in SynchronizedBookIdsAggregator, then SynchronizedBookIdsAggregator can enforce all concurrency control of the list. As my answer above indicates, allowing consumers of the aggregator to use a "getter" to get the list creates a problem for the user of that list: to write correct code they must have knowledge of the synchronization strategy / guard object, and furthermore they must also use that knowledge to actively synchronize externally and correctly.
Regarding your second approach. What you have shown looks technically correct (good!).
But, presumably you are going to read from masterBookIds at some point, too? And you don't show or describe that part of the program! So when you start thinking about when and how you are going to read masterBookIds (i.e. the return value of fetchMasterListOfBookIds()), just remember to consider concurrency concerns there too! :)
If you make sure all tasks/worker threads have finished before you start reading masterBookIds, you shouldn't have to do anything special.
But, at least in the code you have shown, you aren't ensuring that.
Note that libraryBookIdsExecutor.shutdown() returns immediately. So if you start using the masterBookIds list immediately after fetchMasterListOfBookIds() returns, you will be reading masterBookIds while your worker threads are actively writing data to it, and this entails some extra considerations.
Maybe this is what you want -- maybe you want to read the collection while it is being written to, to show realtime results or something. But then you must consider synchronizing properly on the collection if you want to iterate over it while it is being written to.
If you would just like to make sure all writes to masterBookIds by worker threads have completed before fetchMasterListOfBookIds() returns, you could use ExecutorService.awaitTermination (in combination with .shutdown(), which you are already calling).
Collections.SynchronizedList (which is the wrapper type you'd get) would synchronize almost every method on either itself or a mutex object you pass to the constructor (or Collections.synchronizedList(...) ). Thus it would basically be the same as your approach.
All the methods called using the wrapper returned by Collections.synchronizedList() will be synchronized. This means that the addAll method of normal List when called by this wrapper will be something like this :-
synchronized public static <T> boolean addAll(Collection<? super T> c, T... elements)
So, every method call for the list (using the reference returned and not the original reference) will be synchronized.
However, there is no synchronization between different method calls.
Consider following code snippet :-
List<String> l = Collections.synchronizedList(new ArrayList<String>);
l.add("Hello");
l.add("World");
While multiple threads are accessing the same code, it is quite possible that after Thread A has added "Hello", Thread B will start and again add "Hello" and "World" both to list and then Thread A resumes. So, list would have ["hello", "hello", "world", "world"] instead of ["hello", "world", hello", "world"] as was expected. This is just an example to show that list is not thread-safe between different method calls of the list. If we want the above code to have desired result, then it should be inside synchronized block with lock on list (or this).
However, with your design there is only one method call. SO IT IS SAME AS USING Collections.synchronizedList().
Moreover, as Mike Clark rightly pointed out, you should also synchronized getBookIds() and setBookIds(). And synchronizing it over List itself would be more clear since it is like locking the list before operating on it and unlocking it after operating. So that nothing in-between can use the List.

Sharing Static Data Outside of the Class

First, here is a motivating example:
public class Algorithm
{
public static void compute(Data data)
{
List<Task> tasks = new LinkedList<Task>();
Client client = new Client();
int totalTasks = 10;
for(int i = 0; i < totalTasks; i++)
tasks.add(new Task(data));
client.submit(tasks);
}
}
// AbstractTask implements Serializable
public class Task extends AbstractTask
{
private final Data data;
public Task(Data data)
{
this.data = data;
}
public void run()
{
// Do some stuff with the data.
}
}
So, I am doing some parallel programming and have a method which creates a large number of tasks. The tasks share the data that they will operate on, but I am having problems giving each task a reference to the data. The problem is, when the tasks are serialized, a copy of the data is made for each task. Now, in this task class, I could make a static reference to the data so that it is only stored once, but doing this doesn't really make much sense in the context of the task class. My idea is to store the object as a static in another external class and have the tasks request the object from the class. This can be done before the tasks are sent, likely, in the compute method in the example posted above. Do you think that this is appropriate? Can anyone offer any alternative solutions or tips regarding the idea suggested? Thanks!
Can you explain more about this serialization situation you're in? How do the Tasks report a result, and where does it go -- do they modify the Data? Do they produce some output? Do all tasks need access to all the Data? Are any of the Tasks written to the same ObjectOutputStream?
Abstractly, I guess I can see two classes of solutions.
If the Tasks don't all need access to all the Data, I would try to give each Task only the data that it needs.
If they do all need all of it, then instead of having the Task contain the Data itself, I would have it contain an ID of some kind that it can use to get the data. How to get just one copy of the Data transferred to each place a Task could run, and give the Task access to it, I'm not sure, without better understanding the overall situation. But I would suggest trying to manage the Data separately.
I'm not sure I fully understand the question, but it sounds to me as though Tasks are actually serialized for later execution.
If this is the case, an important question would be whether all of the Task objects are written to the same ObjectOutputStream. If so, the Data will only be serialized the first time it is encountered. Later "copies" will just reference the same object handle from the stream.
Perhaps one could take advantage of that to avoid static references to the data (which can cause a number of problems in OO design).
Edit: The answer below is not actually relevant, due to a misunderstanding about what was being asked. Leaving it here pending more details from the question's author.
This is precisely why the transient keyword was invented.
Declares that an instance field is not
part of the default serialized form of
an object. When an object is
serialized, only the values of its
non-transient instance fields are
included in the default serial
representation. When an object is
deserialized, transient fields are
initialized only to their default
value.
public class Task extends AbstractTask {
private final transient Data data;
public Task(Data data) {
this.data = data;
}
public void run() {
// Do some stuff with the data.
}
}
Have you considered making a singleton instead of making it static?
My idea is to store the object as a
static in another external class and
have the tasks request the object from
the class.
Forget about this idea. When the tasks are serialzed and sent over the network, that object will not be sent; static data is not (and cannot) be shared in any way between JVMs.
Basically, if your Tasks are serialized separately, the only way to share the data is to send it separately, or send it only in one task and somehow have the others acquire it on the receiving machine. This could happen via a static field that the one task that has the data sets and the others query, but of course that requires that one task to be run first. And it could lead to synchronization problems.
But actually, it sounds like you are using some sort of processing queue that assumes tasks to be self-contained. By trying to have them share data, you are going against that concept. How big is your data anyway? Is it really absolutely necessary to share the data?

Categories

Resources