First, here is a motivating example:
public class Algorithm
{
public static void compute(Data data)
{
List<Task> tasks = new LinkedList<Task>();
Client client = new Client();
int totalTasks = 10;
for(int i = 0; i < totalTasks; i++)
tasks.add(new Task(data));
client.submit(tasks);
}
}
// AbstractTask implements Serializable
public class Task extends AbstractTask
{
private final Data data;
public Task(Data data)
{
this.data = data;
}
public void run()
{
// Do some stuff with the data.
}
}
So, I am doing some parallel programming and have a method which creates a large number of tasks. The tasks share the data that they will operate on, but I am having problems giving each task a reference to the data. The problem is, when the tasks are serialized, a copy of the data is made for each task. Now, in this task class, I could make a static reference to the data so that it is only stored once, but doing this doesn't really make much sense in the context of the task class. My idea is to store the object as a static in another external class and have the tasks request the object from the class. This can be done before the tasks are sent, likely, in the compute method in the example posted above. Do you think that this is appropriate? Can anyone offer any alternative solutions or tips regarding the idea suggested? Thanks!
Can you explain more about this serialization situation you're in? How do the Tasks report a result, and where does it go -- do they modify the Data? Do they produce some output? Do all tasks need access to all the Data? Are any of the Tasks written to the same ObjectOutputStream?
Abstractly, I guess I can see two classes of solutions.
If the Tasks don't all need access to all the Data, I would try to give each Task only the data that it needs.
If they do all need all of it, then instead of having the Task contain the Data itself, I would have it contain an ID of some kind that it can use to get the data. How to get just one copy of the Data transferred to each place a Task could run, and give the Task access to it, I'm not sure, without better understanding the overall situation. But I would suggest trying to manage the Data separately.
I'm not sure I fully understand the question, but it sounds to me as though Tasks are actually serialized for later execution.
If this is the case, an important question would be whether all of the Task objects are written to the same ObjectOutputStream. If so, the Data will only be serialized the first time it is encountered. Later "copies" will just reference the same object handle from the stream.
Perhaps one could take advantage of that to avoid static references to the data (which can cause a number of problems in OO design).
Edit: The answer below is not actually relevant, due to a misunderstanding about what was being asked. Leaving it here pending more details from the question's author.
This is precisely why the transient keyword was invented.
Declares that an instance field is not
part of the default serialized form of
an object. When an object is
serialized, only the values of its
non-transient instance fields are
included in the default serial
representation. When an object is
deserialized, transient fields are
initialized only to their default
value.
public class Task extends AbstractTask {
private final transient Data data;
public Task(Data data) {
this.data = data;
}
public void run() {
// Do some stuff with the data.
}
}
Have you considered making a singleton instead of making it static?
My idea is to store the object as a
static in another external class and
have the tasks request the object from
the class.
Forget about this idea. When the tasks are serialzed and sent over the network, that object will not be sent; static data is not (and cannot) be shared in any way between JVMs.
Basically, if your Tasks are serialized separately, the only way to share the data is to send it separately, or send it only in one task and somehow have the others acquire it on the receiving machine. This could happen via a static field that the one task that has the data sets and the others query, but of course that requires that one task to be run first. And it could lead to synchronization problems.
But actually, it sounds like you are using some sort of processing queue that assumes tasks to be self-contained. By trying to have them share data, you are going against that concept. How big is your data anyway? Is it really absolutely necessary to share the data?
Related
I have this simple server-socket application using TCP that allows multiple clients to connect to a server and store and retrieve data from an ArrayList.
Everytime a client connection is accepted into the server, a new thread is created to handle the client activity, and since all clients must interact with the same data structure I decided to create a static ArrayList as follows:
public class Repository {
private static List<String> data;
public static synchronized void insert(String value){
if(data == null){
data = new ArrayList<>();
}
data.add(value);
}
public static synchronized List<String> getData() {
if(data == null){
data = new ArrayList<>();
}
return data;
}
}
So, everytime a client inserts a value or reads the list they just call Repository.insert(value) or Repository.getData() from their respective threads.
My questions are:
Making these methods synchronized is enough to make the operations thread safe?
Are there any architectural or performance issues with this static List thing? I could also create an instance of the List in the server class (the one that accepts the connections) and send a reference via contructor to the Threads instead of using the static. Is this any better?
Can Collections.synchronizedList() add any value to such a simple task? Is it necessary?
How could I test the thread-safety in this scenario? I tried just creating multiple clients and make them access the data and everything seems to work, but I'm just not convinced... Here is a short snippet of this test:
IntStream.range(0,10).forEach(i->{
Client client = new Client();
client.ConnectToServer();
try {
client.SendMessage("Hello from client "+i);
} catch (IOException e) {
e.printStackTrace();
}});
//assertions on the size of the array
Thanks in advance! :)
Yes, see https://stackoverflow.com/a/2120409/3080094
Yes (see comments further below) and yes. Using one object (a singleton if you like) is preferred over static methods (the latter are, in general, harder to maintain).
Not necessary but preferred: it avoids you making mistakes. Also, instead of:
private static List<String> data;
you can use
private static final List<String> data = Collections.synchronizedList(new ArrayList<>());
which has three benefits: final ensures all threads see this value (thread-safe), you can remove the code for checking on a null-value (which can be error-prone) and you no longer need to use synchronized in your code since the list itself is now synchronized.
Yes, there are ways to improve this to make it more likely you find bugs but when it comes to multi-threading you can never be entirely sure.
About the " architectural or performance issues": every read of the list has to be synchronized so when multiple clients want to read the list, they will all be waiting for a lock to read the entire list. Since you are only inserting at the end and reading the list, you can use a ConcurrentLinkedQueue. This "concurrent" type (i.e. no need to use synchronized - the queue is thread-safe) will not lock when the entire list is read, multiple threads can read the data at the same time. In addition, you should hide the details of your implementation which you can do, for example, with an Iterator:
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
private final Queue<String> dataq = new ConcurrentLinkedQueue<>();
public Iterator<String> getData() {
return dataq.iterator();
}
About "testing the thread-safety": focus on the code that needs to be thread-safe. Using clients to connect to a server to test if the Repository code is thread-safe is inefficient: most of the test will just be waiting on I/O and not actually using Repository at the same time. Write a dedicated (unit) test just for the Repository class. Keep in mind that your operating system determines when threads start and run (i.e. your code for threads to start and run are just hints to the operating system), so you will need the test to run for a while (i.e. 30 seconds) and provide some output (logging) to ensure threads are running at the same time. Output to console should only be shown at the end of the test: in Java output to console (System.out) is synchronized which in turn can make threads work one after the other (i.e. not at the same time on the class under test).
Finally, you can improve the test using a java.util.concurrent.CountDownLatch to let all threads synchronize before executing the next statements concurrently (this improves the chance of finding race-conditions). That is a bit much for this already long answer to explain, so I'll leave you with a (admittedly complicated) example (focus on how the tableReady variable is used).
I have several threads trying to increment a counter for a certain key in a not thread-safe custom data structure (which you can image to be similiar to a HashMap). I was wondering what the right way to increment the counter in this case would be.
Is it sufficient to synchronize the increment function or do I also need to synchronize the get operation?
public class Example {
private MyDataStructure<Key, Integer> datastructure = new CustomDataStructure<Key, Integer>();
private class MyThread implements Runnable() {
private synchronized void incrementCnt(Key key) {
// from the datastructure documentation: if a value already exists for the given key, the
// previous value will be replaced by this value
datastructure.put(key, getCnt(key)+1);
// or can I do it without using the getCnt() function? like this:
datastructure.put(key, datastructure.get(key)+1));
}
private synchronized int getCnt(Key key) {
return datastructure.get(key);
}
// run method...
}
}
If I have two threads t1, t2 for example, I would to something like:
t1.incrementCnt();
t2.incrmentCnt();
Can this lead to any kind of deadlock? Is there a better way to solve this?
Main issue with this code is that it's likely to fail in providing synchronization access to datastructure, since accessing code synchronizing on this of an inner class. Which is different for different instances of MyThread, so no mutual exclusion will happen.
More correct way is to make datastructure a final field, and then to synchronize on it:
private final MyDataStructure<Key, Integer> datastructure = new CustomDataStructure<Key, Integer>();
private class MyThread implements Runnable() {
private void incrementCnt(Key key) {
synchronized (datastructure) {
// or can I do it without using the getCnt() function? like this:
datastructure.put(key, datastructure.get(key)+1));
}
}
As long as all data access is done using synchronized (datastructure), code is thread-safe and it's safe to just use datastructure.get(...). There should be no dead-locks, since deadlocks can occur only when there's more than one lock to compete for.
As the other answer told you, you should synchronize on your data structure, rather than on the thread/runnable object. It is a common mistake to try to use synchronized methods in the thread or runnable object. Synchronization locks are instance-based, not class-based (unless the method is static), and when you are running multiple threads, this means that there are actually multiple thread instances.
It's less clear-cut about Runnables: you could be using a single instance of your Runnable class with several threads. So in principle you could synchronize on it. But I still think it's bad form because in the future you may want to create more than one instance of it, and get a really nasty bug.
So the general best practice is to synchronize on the actual item that you are accessing.
Furthermore, the design conundrum of whether or not to use two methods should be solved by moving the whole thing into the data structure itself, if you can do so (if the class source is under your control). This is an operation that is confined to the data structure and applies only to it, and doing the increment outside of it is not good encapsulation. If your data structure exposes a synchronized incrementCnt method, then:
It synchronizes on itself, which is what you wanted.
It can use its own private fields directly, which means you don't actually need to call a getter and a setter.
It is free to have the implementation changed to one of the atomic structures in the future if it becomes possible, or add other implementation details (such as logging increment operations separately from setter access operations).
I have a utility class in Java which is accessing a big file system to access a file.
Some files are huge so whats happening is that the Utility class is talking a lot of time to access these files and i am facing a performance issue here.
I plan to implement Multithreading to improve performance but i am bit confused as to how i need to do that. below is the structure of the Utility class.
public class Utility {
public static void Method1(ArrayList values){
//do some processing
for(int i=0; i< values.size();i++){
ArrayList<String> details= MethodAccessFileSystem();
CreateFileInDir(details);
}
}
public ArrayList<String> MethodAccessFileSystem(){
//Code to access the file system. This is taking hell lot of time.
}
public void CreateFileInDir(ArrayList<String> values){
//Do some processing here.
}
}
I used to call this Utilty class in a standalone class using the following syntax
Utility.Method1(values); //values is an ArayList.
Now i need to convert the above code into a Multithreaded code.
I know how to create a thread by extending Thread class or implementing a Runnable.
I have a basic idea about that.
But what i need to know is should i convert this whole Utilty class to implement Runnable.
or should parts of the Utilty class needs to seperated and made as Runnable task.
My issue is with the for() loop as these methods are called in loop.
if i separate out MethodAccessFileSystem() and make it as a task will this work.
If MethodAccessFileSystem() is taking a time then will the JVM automaticaly start another thread if i use a Threadpoolexecutor to schedule a fixed number of threads.
Should i need to suspend this method or it is not required or JVM will take care.
The main issue is with the For loop.
At the end what i need is that the Utility class should be Multithreaded and the call to method should be the same as the above.
Utility.Method1(values); //values is an ArayList.
I am thinking as to how i can implement that.
Can you please help me with this and provide your suggestions and feedback on the design changes that need to be made.
Thanks
Vikeng
From your class According to me the chunk of work which fits in Parallelism principle is below loop.
// do some processing
for (int i = 0; i < values.size(); i++) {
new Thread(new Runnable() {
#Override
public void run() {
ArrayList<String> details = MethodAccessFileSystem();
CreateFileInDir(details);
}
});
}
Before you make the change make sure that multiple threads will help. Run the method and as best you can check CPU and disk i/o activity. Also check to see if there's any garbage collection going on.
If any of those conditions exist then adding threads really won't help. You'll have to address that specific condition in order to get any throughput improvements.
Having said that the trick to making the code thread safe is to not have any instance variables on the class that are used to hold state during the method execution. For each existing instance variable, you need to decide whether to make it a local variable declared within the method or a method parameter.
public class ObjectA {
private void foo() {
MutableObject mo = new MutableObject();
Runnable objectB = new ObjectB(mo);
new Thread(objectB).start();
}
}
public class ObjectB implements Runnable {
private MutableObject mo;
public ObjectB(MutableObject mo) {
this.mo = mo;
}
public void run() {
//read some field from mo
}
}
As you can see from the code sample above, I pass a mutable object to a class that implements Runnable and will use the mutable object in another thread. This is dangerous because ObjectA.foo() can still alter the mutable object's state after starting the new thread. What is the preferred way to ensure thread safety here? Should I make copy of the MutableObject when passing it to ObjectB? Should the mutable object ensure proper synchronization internally? I've come across this many times before, especially when trying to use SwingWorker in a number of GUI applications. I usually try to make sure that ONLY immutable object references are passed to a class that will use them in another thread, but sometimes this can be difficult.
This is a hard question, and the answer, unfortunately, is 'it depends'. You have three choices when it comes to thread-safety of your class:
Make it Immutable, then you don't have to worry. But this isn't what you're asking.
Make it thread-safe. That is, provide enough concurrency control internal to the class that client code doesn't have to worry about concurrent threads modifying the object.
Make it not-thread safe, and force client code to have some kind of external synchronization.
You're essentially asking whether you should use #2 or #3. You are worried about the case where another developer uses the class and doesn't know that it requires external synchronization. I like using the JCIP annotations #ThreadSafe #Immutable #NotThreadSafe as a way to document the concurrency intentions. This isn't bullet-proof, as developers still have to read the documentation, but if everyone on the team understands these annotations and consistently applies them, it does make things clearer.
For your example, if you want to make the class not thread-safe, you could use AtomicReference to make it clear and provide synchronization.
public class ObjectA {
private void foo() {
MutableObject mo = new MutableObject();
Runnable objectB = new ObjectB(new AtomicReference<>( mo ) );
new Thread(objectB).start();
}
}
public class ObjectB implements Runnable {
private AtomicReference<MutableObject> mo;
public ObjectB(AtomicReference<MutableObject> mo) {
this.mo = mo;
}
public void run() {
//read some field from mo
mo.get().readSomeField();
}
}
I think you are overcomplicating it. If it is as the example (a local variable of which no reference is kept) you should trust that nobody will try to write to it. If it is more complicated (A.foo() has more LOC) if possible, create it only to pass to the thread.
new Thread(new MutableObject()).start();
If not (due to initializations), declare it in a block so it gets out of scope immediately, even maybe in a separate private method.
{
MutableObject mo = new MutableObject();
Runnable objectB = new ObjectB(mo);
new Thread(objectB).start();
}
....
Copy the object. You won't have any weird visibility problems because you pass the copy to a new Thread. Thread.start always happens before the new thread enters its run method. If you change this code to pass the object to an existing thread, you need proper synchronization. I recommend a blocking queue from Java.util.concurrent.
Without knowing your exact situation, this question will be difficult to answer precisely. The answer totally depends on what the MutableObject represents, how many other threads may modify it simultaneously, and whether or not the threads that read the object care whether its state changes while they are reading it.
With respect to thread-safety, internally synchronizing all reads and writes to MutableObject is provably the "safest" thing to do, but it comes at the cost of performance. If contention is really high on reads and writes, then your program may suffer performance issues. You can get better performance by sacrificing some guarantees on mutual exclusion - whether those sacrifices are worth the performance increases totally depends on the specific problem you're trying to solve.
You can also play some games with how you go about "internally synchronizing" your MutableObject, if that's what you end up doing. If you haven't already, I'd recommend reading up on the differences between volatile and synchronized and understand how each can be used to ensure thread safety for different situations.
assume we have 2 threads, thread A and thread B.
thread A is the main method and contain large data structures.
is it possible to create a second thread and pass the address(a pointer) of the data structure (local to thread A) to thread B so both thread can read from the data structure?
the point of this is to avoid the need to duplicate the entire data structure on thread B or spend a lot of time pulling relevant information from the data structure for thread B to use
keep in mind that neither thread is modifying the data
In Java, the term pointer is not used, but reference.
It is possible to pass it, as any other object, to another thread.
As any (non-final) class in Java, you can extend it, add members, add constructors etc.
(If you need to modify the data) You need to make sure that there are no concurrency issues.
It's known as a reference in java, as you don't have access directly to a pointer in a conventional sense. (For most cases it's "safe" to think of it as every reference is a pointer that is always passed by value and the only legal operation is to dereference it. It is NOT the same as a C++ 'reference.')
You can certainly share references among threads. Anything that's on the heap can be seen and used by any thread that can get a reference to it. You can either put it in a static location, or set the value of a reference on your Runnable to point to the data.
public class SharedDataTest {
private static class SomeWork implements Runnable {
private Map<String, String> dataTable;
public SomeWork(Map<String, String> dataTable) {
this.dataTable = dataTable;
}
#Override
public void run() {
//do some stuff with dataTable
}
}
public static void main(String[] args) {
Map<String, String> dataTable = new ConcurrentHashMap<String, String>();
Runnable work1 = new SomeWork(dataTable);
Runnable work2 = new SomeWork(dataTable);
new Thread(work1).start();
new Thread(work2).start();
}
}
Yes it is possible and is a usual thing to do but you need to make sure that you use proper synchronization to ensure that both threads see an up to date version of the data.
It is safe to share a reference to immutable object. Roughly speaking, immutable object is the object that doesn't change its state after construction. Semantically immutable object should contain only final fields which in turn reference immutable objects.
If you want to share reference to mutable object you need to use proper synchronization, for example by using synchronized or volatile keywords.
Easy way to share data safely would be to use utilities from java.util.concurrent package such as AtomicReference or ConcurrentHashMap, however you still have to be very careful if objects you share are mutable.
If you are not doing any modification in the shared data you can have a shared reference and there will be no significant overhead.
Be careful however when you start modifying the shared object concurrently, in this case you can use the data structures provided in java (see for instance factory methods in Collections), or use a custom synchronisation scheme, for instance with java.util.concurrent.locks.ReentrantLock.