is hadoop DistributedFileSystem thread safe?

is hadoop DistributedFileSystem thread safe? - java

I am using hadoop for writing data I scrape.
I have a spring service that is called from multiple threads to write some content to the HDFS.
#Service
public class WriteService
{
public void write(String path, String content)
{
FileSystem fs = FileSystem.get(conf);
}
}
I am not sure whether the FileSystem object can be a member of the WriteService and I don't find whether it is thread safe or not.
I am using the DistributedFileSystem object.
Do you know if it is thread-safe and I can use it as a member to my service?
Thank you

Hadoop DFS uses a so-called WORM-Model. This makes it more robust when it comes to concurrency issues.
But, to answer the question, it is not safe in general. You still need to think about concurrency control requirements.

If config.setBoolean("fs.hdfs.impl.disable. cache", true); is modified first, FileSystem.get(config) can be used in multiple threads.

Related

How I can implement AWS S3 client in a thread safe manner?

Hi I have a method that is executed by multiple threads concurrently to connect to the s3 bucket objects and read metadata. All those methods are using a single s3 client object. Based on the Amazon Java SDK documentation I found that the s3Clients are thread safe objects. Can the following way of implementation cause any deadlock or performance issue? Is this the correct way of implementation when using multiple thread with s3 client?
public class S3Client {
// static method that returns the s3 client for all the requests
public static AmazonS3 getS3Client(){
return AmazonS3ClientBuilder.standard().withRegion(Regions.DEFAULT_REGION).build();
}
}
And there is another class(RequestHandler->readObject method) that will be executed by multiple threads concurrently. Hence will be executed for each and every requests.
public class RequestHandler {
// Multiple concurrent threads are accessing this method
public void readObject(){
AmazonS3 s3Client = S3Client.getS3Client();
ListObjectsV2Result result = s3Client.listObjectsV2("bucket_name");
}
}
Please advice. Thanks in advance!!

Lets go one by one:
The builders in the java AWS S3 sdk are generally not thread safe. So
try not to use S3Client#getS3Client() in multi-threaded environment.
AmazonS3Client is annotated with #ThreadSafe. This is an annotation in Java AWS sdk, that marks the class as thread-safe. So there is no need to create some sort of object factory like you did, you only can have one AmazonS3Client singleton object per application. In the examples above you clearly create new instance per each and every RequestHandler#readObject() method invocation. It is not only unsafe, it will likely cause a performance issue, since you will create a lot of AmazonS3Client, which will degrade your java app garbage collection process.
You can solve pretty much all of it if you will just use a singleton pattern, i.e create AmazonS3Client as a singleton object, either by spring, or by any other IoC framework, or by yourself, for example via double check locking. In this way you will achieve thread safety along with relatively good performance (in comparison to code in the question).
Hope it helped, have a nice day!)

Attempt execution uniquness in a java project?

I am working a java library, which has a singleton class with a methods - createTask() and addPointsToTask()
The library is meant to be used in any java service which executes multiple requests.
The service should be able to call createTask only once during it's processing of a single request. Any further calls to createTask in the same thread execution should fail. addPointsToTask can be called any number of times.
As a library owner how can I restrict this method to be called only once per thread?
I have explored ThreadLocal, but don't think it fits my purpose.
One solution is to ask the service that is using the library to set a unique id in threadLocal, but as this 'set-to-thread-local' solution is outside the boundary of the library, this is not a full-proof solution.
Any hints?

Short answer: you won't get a "fool-proof" solution; i.e. a solution that someone can't subvert.
Unless you are running your library on a JVM platfrom that you control, users of your library will be able to find a way to subvert your "only once per thread" restriction if they try hard enough. For example:
They could use reflection to access the private state of the objects or classes that implement the restriction.
They could use bytecode injection to subvert your code.
They could decompile and replacing your code.
They could modify their JVM to do something funky with your code. (The OpenJDK source code is available to anyone.)
Ask yourself the following:
Is this restriction reasonable from the perspective of the programmer you are trying to restrict?
Would a sensible programmer have good reason to try to break it?
Have you considered possible use-cases for your library where it would be reasonable to call createTask() multiple times? For example, use-cases that involve using thread pools?
If you are doing this because you think allowing multiple createTask() calls will break your library, my advice would be:
Tell the programmer via the javadocs and other documentation what is likely to break if they do the thing that you are trying to prevent.
Implement a "soft" check, and provide an easy way for a programmer to disable the check. (But do the check by default, if you think that is appropriate.)
The point is that a sensible programmer won't knowingly subvert restrictions unless they have good reason to. If they do, and they hurt themselves, that is not your problem.
On the other hand, you are implementing this restriction for "business reasons" or to stop "cheating" or something like that, my advice would be to recognize that a determined user will be able to subvert any restrictions you attempt to embed in your code when they run it on their platform. If this fundamentally breaks your model, look for a different model.

You will not be able to prohibit multiple calls from the same request, simply because your library has no concept of what a "request" actually is. This very much depends on the service using the library. Some services may use a single thread per request, but others may not. Using thread-locals is error-prone especially when you are working in multi-threaded or reactive applications where code processing a request can execute on multiple parallel threads.
If your requirement is that addPointsToTask is only called for a task that was actually started by some code that is processing the current request, you could set up your API like that. E.g. createTask could return a context object that is required to call addPointsToTask later.
public TaskContext createTask() {
}
public void addPointsToTask(TaskContext context, ....) {
}
This way you can track task context even over multiple different threads executing code for the same request and points will not get added to a task created by another request.

You could add a method to your singleton which runs some piece of Service-Code in the context of a request.
Dummy implementation:
package stackoverflow;
import java.util.concurrent.Callable;
public enum YourLibrarySingleton {
INSTANCE;
private final ThreadLocal<Task> threadLocalTask;
YourLibrarySingleton() {
this.threadLocalTask = new ThreadLocal<>();
}
public void createTask() {
this.threadLocalTask.set(new Task() {});
}
public void addPointsToTask() {
Task task = this.threadLocalTask.get();
// add points to that task
}
public <T> T handleRequest(Callable<T> callable) throws Exception {
try {
return callable.call();
} finally {
this.threadLocalTask.remove();
}
}
}
Which could be used like this:
package stackoverflow;
public class ServiceCode {
public void handleRequest() throws Exception {
YourLibrarySingleton.INSTANCE.handleRequest(() -> {
YourLibrarySingleton.INSTANCE.createTask();
YourLibrarySingleton.INSTANCE.addPointsToTask();
YourLibrarySingleton.INSTANCE.addPointsToTask();
return "result";
});
}
}

Have these two java.io.File thread safety issues been evaded?

Assuming a Win32FileSystem and beginMultiThreading runs many times simultaneously on a shared MultiThreadingClass object, what is the most possible way that this can cause a data-race or some other threading issue? I know that this is probably not thread safe, because (1) the argument to setPath gets reused. I see also that (2) path is not a final variable in java.io.File. However, I can't seem to find a part where this code could error out on its own due to threading issue.
public class MultiThreadingClass {
private Holder h = new Holder();
private String path ="c:\\somepath";
public void beginMultiThreading(){
h.setPath(new File(path));
h.begin();
}
}
public class Holder {
private File path;
public void setPath(File path){
this.path = path;
}
public void begin(){
System.out.println(path.getCanonicalPath()+"some string");
}
}

As #Duncan says, the code is currently thread-safe. But it doesn't do any file writing at this time. As you are using File objects, I have an expectation that you will be dealing with files. Once you start to write files, there are further considerations:
Writing to a single file from multiple threads needs to be synchronized. To my knowledge, this is not "out of the box" functionality.
Writing to the same file from different JVMs or even from different class loaders in the same JVM is much harder. (With most web frameworks, writing to a logging file from multiple web apps is an example of writing to a single file from different class loaders). You are back to using a lock file or a platform-specific mutex of some sort.
Caveat: It is a while since I have had to do this, so there may be more support in the latest Java concurrency package or NIO package that someone else can expand on.

Your example code has no multi-threading at all. So I'll assume that either multiple threads are operating on their own MultiThreadingClass instance, or that they are sharing a common instance between them.
Either way, this code is thread safe. The only shared state is a private string object, which is not adjusted as part of your methods.

Synchronize file object

From what I know and researched, the synchronized keyword in Java lets synchronize a method or code block statement to handle multi-threaded access. If I want to lock a file for writing purposes on a multi-threaded environment, I must should use the classes in the Java NIO package to get the best results. Yesterday, I come up with a question about handling a shared servlet for file I/O operations, and BalusC comments are good to help with the solution, but the code in this answer confuses me. I'm not asking community "burn that post" or "let's downvote him" (note: I haven't downvoted it or anything, and I have nothing against the answer), I'm asking for an explanation if the code fragment can be considered a good practice
private static File theFile = new File("theonetoopen.txt");
private void someImportantIOMethod(Object stuff){
/*
This is the line that confuses me. You can use any object as a lock, but
is good to use a File object for this purpose?
*/
synchronized(theFile) {
//Your file output writing code here.
}
}

The problem is not about locking on a File object - you can lock on any object and it does not really matter (to some extent).
What strikes me is that you are using a non final monitor, so if another part of your code reallocates theFile: theFile = new File();, the next thread that comes around will lock with a different object and you don't have any guarantee that your code won't be executed by 2 threads simultaneously any more.
Had theFile been final, the code would be ok, although it is preferable to use private monitors, just to make sure there is not another piece of code that uses it for other locking purposes.

If you only need to lock the file within a single application then it's OK (assuming final is added).
Note that the solution won't work if you load the class more than once using different class loaders. For example, if you have a web application that is deployed twice in the same web server, each instance of the application will have its own lock object.
As you mention, if you want the locking to be robust and have the file locked from other programs too, you should use FileLock (see the docs, on some systems it is not guaranteed that all programs must respect the lock).

Had you seen: final Object lock = new Object() would you be asking?
As #assylias pointed out the problem is that the lock is not final here

Every object in Java can act as a lock for synchronization. They are called intrinsic locks. Only one thread at a time can execute a block of code guarded by a given lock.
More on that: http://docs.oracle.com/javase/tutorial/essential/concurrency/locksync.html
Using synchronized keyword for the whole method could have performance impact on your application. That's why you can sometimes use synchronized block.
You should remember that lock reference can't be changed. The best solution is to use final keyword.

Concurrency file system

I need to create a File System Manager (more or less) which can read or write data to files.
My problem is how do I handle concurrency?
I can do something like
public class FileSystemManager {
private ReadWriteLock readWriteLock = new ReentrantReadWriteLock();
public byte[] read(String path) {
readWriteLock.readLock().lock();
try {
...
} finally {
readWriteLock.readLock().unlock();
}
}
public void write(String path, byte[] data) {
readWriteLock.writeLock().lock();
try {
...
} finally {
readWriteLock.writeLock().unlock();
}
}
}
But this would mean all access to the write (for example) will be locked, even if the first invocation is targeting /tmp/file1.txt and the second invocation is targeting /tmp/file2.txt.
Any ideas how to go about this?

Suggest Message Passing For Concurrency Not Threads
In general, this kind of locking happens beneath the java level. Are you really planning on reading and writing the same files and directories? Implementing directories?
Right now there is lots of unsafe threading code that may start blowing up as threads start really running together on multicore hardware.
My suggestion would be to manage concurrency via message passing. You can roll your own if this is an educational exercise, or use one of zillions of queuing and message systems out there. In this kind of system you have only one reader/writer process, but possibly many clients. You avoid the need for file and thread concurrency management.
One advantage of the message passing paradigm is that it will be much, much easier to create a distributed system for load balancing.

Can't you create a different object for each Path and then use synchronize blocks and synchronize on "this"?

You can store the ReadWriteLock instances in a map keyed on path, just make sure that you get concurrent access to the map correct (possibly using ConcurrentHashMap).
If you actually care about locking the file using operating system primitives you might try looking into using java.nio.FileChannel. This has support for fine grained locking of file regions among other things. Also check out java.nio.channels.FileLock.

I would look deeply into Java 5 and the java.util.concurrent package. I'd also recommend reading Brian Goetz' "Java Concurrency in Practice".

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.