Synchronize files processing across cluster

Synchronize files processing across cluster - java

I run a cluster containing 2 or more instances of the same microservice.
Each of them access files on a shared data share, which in mounted as a local folder on both servers running microservices. Each file can be processed only once(in the entire cluster).
I want to have those files processed in parellel by nodes, so no file is being
more than once in the entire cluster.
Looking for idea how to solve it
I already thought about one node reading the files and putting their filenames into queue, so that nodes can read it from queue.
Also thought about synchronizing via database, where each node when trying to process file uses db to synchronize with other nodes.
Any idea how to solve it in a good manner?

something like this might work:
String pathToFile = "/tmp/foo.txt";
try {
Files.createFile(FileSystems.getDefault().getPath(pathToFile + ".claimed"));
processFile(pathToFile);
} catch (FileAlreadyExistsException e) {
// some other app has already claimed "filename"
}
and you'll need these imports:
import java.nio.file.FileAlreadyExistsException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
The idea is that each app instance agrees to work on any given file only if it is first able to create a ".claimed" file in the same shared filesystem. This works because of behavior of Files.createFile:
Creates a new and empty file, failing if the file already exists. The check for the existence of the file and the creation of the new file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the directory.
(from this Javadoc:
https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#createFile(java.nio.file.Path,%20java.nio.file.attribute.FileAttribute...) )

Related

How to copy multiple files atomically from src to dest in java?

in one requirement, i need to copy multiple files from one location to another network location.
let assume that i have the following files present in the /src location.
a.pdf, b.pdf, a.doc, b.doc, a.txt and b.txt
I need to copy a.pdf, a.doc and a.txt files atomically into /dest location at once.
Currently i am using Java.nio.file.Files packages and code as follows
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1 = Paths.get("/dest/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2 = Paths.get("/dest/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3 = Paths.get("/dest/a.txt");
Files.copy(srcFile1, destFile1);
Files.copy(srcFile2, destFile2);
Files.copy(srcFile3, destFile3);
but this process the file are copied one after another.
As an alternate to this, in order to make whole process as atomic,
i am thinking of zipping all the files and move to /dest and unzip at the destination.
is this approach is correct to make whole copy process as atomic ? any one experience similar concept and resolved it.

is this approach is correct to make whole copy process as atomic ? any one experience similar concept and resolved it.
You can copy the files to a new temporary directory and then rename the directory.
Before renaming your temporary directory, you need to delete the destination directory
If other files are already in the destination directory that you don't want to overwrite, you can move all files from the temporary directory to the destination directory.
This is not completely atomic, however.
With removing /dest:
String tmpPath="/tmp/in/same/partition/as/source";
File tmp=new File(tmpPath);
tmp.mkdirs();
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1 = Paths.get(tmpPath+"/dest/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2 = Paths.get(tmpPath+"/dest/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3 = Paths.get(tmpPath+"/dest/a.txt");
Files.copy(srcFile1, destFile1);
Files.copy(srcFile2, destFile2);
Files.copy(srcFile3, destFile3);
delete(new File("/dest"));
tmp.renameTo("/dest");
void delete(File f) throws IOException {
if (f.isDirectory()) {
for (File c : f.listFiles())
delete(c);
}
if (!f.delete())
throw new FileNotFoundException("Failed to delete file: " + f);
}
With just overwriting the files:
String tmpPath="/tmp/in/same/partition/as/source";
File tmp=new File(tmpPath);
tmp.mkdirs();
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1=paths.get("/dest/a.pdf");
Path tmp1 = Paths.get(tmpPath+"/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2=Paths.get("/dest/a.doc");
Path tmp2 = Paths.get(tmpPath+"/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3=Paths.get("/dest/a.txt");
Path destFile3 = Paths.get(tmpPath+"/a.txt");
Files.copy(srcFile1, tmp1);
Files.copy(srcFile2, tmp2);
Files.copy(srcFile3, tmp3);
//Start of non atomic section(it can be done again if necessary)
Files.deleteIfExists(destFile1);
Files.deleteIfExists(destFile2);
Files.deleteIfExists(destFile2);
Files.move(tmp1,destFile1);
Files.move(tmp2,destFile2);
Files.move(tmp3,destFile3);
//end of non-atomic section
Even if the second method contains a non-atomic section, the copy process itself uses a temporary directory so that the files are not overwritten.
If the process aborts during moving the files, it can easily be completed.
See https://stackoverflow.com/a/4645271/10871900 as reference for moving files and https://stackoverflow.com/a/779529/10871900 for recursively deleting directories.

First there are several possibilities to copy a file or a directory. Baeldung gives a very nice insight into different possibilities. Additionally you can also use the FileCopyUtils from Spring. Unfortunately, all these methods are not atomic.
I have found an older post and adapt it a little bit. You can try using the low-level transaction management support. That means you make a transaction out of the method and define what should be done in a rollback. There is also a nice article from Baeldung.
#Autowired
private PlatformTransactionManager transactionManager;
#Transactional(rollbackOn = IOException.class)
public void copy(List<File> files) throws IOException {
TransactionDefinition transactionDefinition = new DefaultTransactionDefinition();
TransactionStatus transactionStatus = transactionManager.getTransaction(transactionDefinition);
TransactionSynchronizationManager.registerSynchronization(new TransactionSynchronization() {
#Override
public void afterCompletion(int status) {
if (status == STATUS_ROLLED_BACK) {
// try to delete created files
}
}
});
try {
// copy files
transactionManager.commit(transactionStatus);
} finally {
transactionManager.rollback(transactionStatus);
}
}
Or you can use a simple try-catch-block. If an exception is thrown you can delete the created files.

Your question lacks the goal of atomicity. Even unzipping is never atomic, the VM might crash with OutOfMemoryError right in between inflating the blocks of the second file. So there's one file complete, a second not and a third entirely missing.
The only thing I can think of is a two phase commit, like all the suggestions with a temporary destination that suddenly becomes the real target. This way you can be sure, that the second operation either never occurs or creates the final state.
Another approach would be to write a sort of cheap checksum file in the target afterwards. This would make it easy for an external process to listen for creation of such files and verify their content with the files found.
The latter would be the same like offering the container/ ZIP/ archive right away instead of piling files in a directory. Most archives have or support integrity checks.
(Operating systems and file systems also differ in behaviour if directories or folders disappear while being written. Some accept it and write all data to a recoverable buffer. Others still accept writes but don't change anything. Others fail immediately upon first write since the target block on the device is unknown.)

FOR ATOMIC WRITE:
There is no atomicity concept for standard filesystems, so you need to do only single action - that would be atomic.
Therefore, for writing more files in an atomic way, you need to create a folder with, let's say, the timestamp in its name, and copy files into this folder.
Then, you can either rename it to the final destination or create a symbolic link.
You can use anything similar to this, like file-based volumes on Linux, etc.
Remember that deleting the existing symbolic link and creating a new one will never be atomic, so you would need to handle the situation in your code and switch to the renamed/linked folder once it's available instead of removing/creating a link. However, under normal circumstances, removing and creating a new link is a really fast operation.
FOR ATOMIC READ:
Well, the problem is not in the code, but on the operation system/filesystem level.
Some time ago, I got into a very similar situation. There was a database engine running and changing several files "at once". I needed to copy the current state, but the second file was already changed before the first one was copied.
There are two different options:
Use a filesystem with support for snapshots. At some moment, you create a snapshot and then copy files from it.
You can lock the filesystem (on Linux) using fsfreeze --freeze, and unlock it later with fsfreeze --unfreeze. When the filesystem is frozen, you can read the files as usual, but no process can change them.
None of these options worked for me as I couldn't change the filesystem type, and locking the filesystem wasn't possible (it was root filesystem).
I created an empty file, mount it as a loop filesystem, and formatted it. From that moment on, I could fsfreeze just my virtual volume without touching the root filesystem.
My script first called fsfreeze --freeze /my/volume, then perform the copy action, and then called fsfreeze --unfreeze /my/volume. For the duration of the copy action, the files couldn't be changed, and so the copied files were all exactly from the same moment in time - for my purpose, it was like an atomic operation.
Btw, be sure to not fsfreeze your root filesystem :-). I did, and restart is the only solution.
DATABASE-LIKE APPROACH:
Even databases cannot rely on atomic operations, and so they first write the change to WAL (write-ahead log) and flush it to the storage. Once it's flushed, they can apply the change to the data file.
If there is any problem/crash, the database engine first loads the data file and checks whether there are some unapplied transactions in WAL and eventually apply them.
This is also called journaling, and it's used by some filesystems (ext3, ext4).

I hope this solution would be useful : as per my understanding you need to copy the files from one directory to another directory.
so my solution is as follows:
Thank You.!!
public class CopyFilesDirectoryProgram {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String sourcedirectoryName="//mention your source path";
String targetdirectoryName="//mention your destination path";
File sdir=new File(sourcedirectoryName);
File tdir=new File(targetdirectoryName);
//call the method for execution
abc (sdir,tdir);
}
private static void abc(File sdir, File tdir) throws IOException {
if(sdir.isDirectory()) {
copyFilesfromDirectory(sdir,tdir);
}
else
{
Files.copy(sdir.toPath(), tdir.toPath());
}
}
private static void copyFilesfromDirectory(File source, File target) throws IOException {
if(!target.exists()) {
target.mkdir();
}else {
for(String items:source.list()) {
abc(new File(source,items),new File(target,items));
}
}
}
}

Multiple directories as Input format in hadoop map reduce

I am trying to run a graph verifier app in distributed system using hadoop.
I have the input in the following format:
Directory1
---file1.dot
---file2.dot
…..
---filen.dot
Directory2
---file1.dot
---file2.dot
…..
---filen.dot
Directory670
---file1.dot
---file2.dot
…..
---filen.dot
.dot files are files storing the graphs.
Is it enough for me to add the input directories path using FileInputFormat.addInputPath()?
I want hadoop to process the contents of each directory in same node because the files present in each directory contains data that depends on the presence of other files of the same directory.
Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
The files in each directory is dependent on each other for data(to be precise...
each directory contains a file(main.dot which has acyclic graph whose vertices are the names of the rest of the files,
so my verifier will traverse each vertex of graph present in main.dot, search for the file of the same name in the same directory and if found processes the data in that file.
similarly all the files will be processed and the combined output after processing each file in the directory is displayed,
same procedure goes for rest of the directories.)
Cutting long story short
As in famous word count application(if the input is a single book), hadoop will split the input and distribute the task to each node in the cluster where the mapper process each line and count the relevant word.
How can i split the task here(do i need to split by the way?)
How can i leverage hadoop power for this scenario, some sample code template will help for sure:)

The soln given by Alexey Shestakov will work. But it is not leveraging MapReduce's distributed processing framework. Probably only one map process will read the file ( file containing paths of all input files) and then process the input data.
How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories?
One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class.
use MultipleInputs.addInputPath() to add the directories and map class for each directory path. Now each mapper can get one directory and process all files within it.

You can create a file with list of all directories to process:
/path/to/directory1
/path/to/directory2
/path/to/directory3
Each mapper would process one directory, for example:
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
// process file
}
}

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
No, it won't. Files are not distributed to each node in the sense that the files are copied to the node to be processed. Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. The datanode then pulls that file from HDFS and processes it.
There's no reason why you can't just open other files you may need directly from HDFS.

Parallelization of file and network I/O operations

The Questions:
Main Question: What's the best strategy to parallel these jobs?
Ideas: How to speed up the process using other mechanisms like a second checksum (Adler32?)
The Szenario:
I'm writing kind of a synchronization tool in java. Basically it downloads a repository from a webserver which represents the file/directory structure on the local machine and defines sources for the needed files in compressed form combined with hash values to verify files. A basic thing i guess.
Requirements:
Multi-platform java desktop application
Best possible speed and parallelization
Example structure: (best described using mods of a game)
Example Repository File
{"name":"subset1", "mods":[
{
"modfolder":"mod1",
"modfiles":[
{
"url":"http://www.example.com/file2.7z",
"localpath":"mod1/file2",
"size":5,
"sizecompressed":3,
"checksum":"46aabad952db3e21e273ce"
},
{
"url":"http://www.example.com/file1.7z",
"localpath":"mod1/file1",
"size":9,
"sizecompressed":4,
"checksum":"862f90bafda118c4d3c5ee6477"
}
]
},
{
"modfolder":"mod2",
"modfiles":[
{
"url":"http://www.example.com/file3.7z",
"localpath":"mod2/file3",
"size":8,
"sizecompressed":4,
"checksum":"cb1e69de0f75a81bbeb465ee0cdd8232"
},
{
"url":"http://www.example.com/file1.7z",
"localpath":"mod2/file1",
"size":9,
"sizecompressed":4,
"checksum":"862f90bafda118c4d3c5ee6477"
}
]
}
]}
Client file structure, as it should be after sync
mod1/
file2
file1
mod2/
file3
file1
// mod1/file2 == mod2/file2
A special thing about the repository:
The Repository got from the server represents only subsets of a bigger repository, because the user only needs a subtree, which is changing (also overlapping).
Sometimes the Repository consists of mod1 and mod2, sometimes mod1 and mod3 and so on.
Work to be done:
Download Repository and parse it (Net I/O)
Mark files not in the repository for deletion at the end of the process (files may be copied because of same checksum) (File I/O)
If file exists: Check checksum of existing file (checksum cache) (File I/O)
If file not exists: Check checksumcache for identical files in other subtrees to copy the file instead of downloading it (Light file I/O)
Download single file in compressed form (Net I/O)
Extract compressed file (File I/O)
Checksum of uncompressed file (File I/O)
Cache checksum associated with file. (Light file I/O)
My solution: (many different producers/consumers)
The Checksum cache is using MapDBs persistent maps.
ATM only md5 checksum is used.
Queues: Every Workertype has a blocking queue (producer/consumer)
Thread Pools: Every Workertype has a fixed Threadpool e.g. 3 Downloader, 2 Checksum, ...
Workers distribute the current job to other queues: Downloader -> Extract -> Checksum
Workertypes:
Localfile Worker: Checks local file structure (using checksum cache),
redirects work to Download-Worker, Delete-Worker
Copy: Copies a file with same checksum to destination
Download: Downloads a file
Checksum: Checksum a file and inserts in checksumcache
Delete: Delete a file
Extract: Extracts a compressed file

What's the best strategy to parallel these jobs?
You have I/O. And, probably, if one job is already in progress on one directory, another job cannot be run on the same directory at the same time.
So, you need locking here. Recommendation: use a locking directory on the filesystem, and use directories, not files, to lock. Why? Because directory creation is atomic (first reason), and because Java 6 does not support atomic file creation (second reason). In fact, you may even need two locking directories: one for content download, another for content processing.
The separation of download vs processing you have already done, so I have nothing more to say here ;)
I am not sure why you want to cache checksums however? It doesn't look that useful to me...
Also, I don't know how big the files you have to deal with are, but why bother with checking the existing directory contents etc vs extract the new directory and rename? Ie:
extract new directory in newdir;
checksums;
move dstdir to dstdir.old;
move newdir to dstdir;
scrap dstdir.old.
This even means you could parallelize scrapping, but that is too much I/O parallelization... You'll have to limit the number of threads doing actual I/O.
EDIT Here is how I would separate processing:
first of all, no checksums anymore on the archive itself, but there is a file in the archive which contains the MD5 sums of each file (for instance, MD5SUMS);
two blocking queues: download -> replace, replace -> scrapping;
one processor takes care of downloading; when it is done, it fills the download -> replace queue;
another processor picks a task from the download -> replace queue; this task performs, in order, unarchive and checksumming; if both are correct, as mentioned above, it renames the existing directory, renames the extracted directory to the expected directory, and puts a scrapping task on the replace -> scrappint queue;
the third, and last, processor, picks a task from the scrapping queue and performs deletion of the previous archive.
Note that the checksumming, if it is that heavy, could be parallelized.

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?

I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs

How to convert a Hadoop Path object into a Java File object

Is there a way to change a valid and existing Hadoop Path object into a useful Java File object. Is there a nice way of doing this or do I need to bludgeon to code into submission? The more obvious approaches don't work, and it seems like it would be a common bit of code
void func(Path p) {
if (p.isAbsolute()) {
File f = new File(p.toURI());
}
}
This doesn't work because Path::toURI() returns the "hdfs" identifier and Java's File(URI uri) constructor only recognizes the "file" identifier.
Is there a way to get Path and File to work together?
**
Ok, how about a specific limited example.
Path[] paths = DistributedCache.getLocalCacheFiles(job);
DistributedCache is supposed to provide a localized copy of a file, but it returns a Path. I assume that DistributedCache make a local copy of the file, where they are on the same disk. Given this limited example, where hdfs is hopefully not in the equation, is there a way for me to reliably convert a Path into a File?
**

I recently had this same question, and there really is a way to get a file from a path, but it requires downloading the file temporarily. Obviously, this won't be suitable for many tasks, but if time and space aren't essential for you, and you just need something to work using files from Hadoop, do something like the following:
import java.io.File;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public final class PathToFileConverter {
public static File makeFileFromPath(Path some_path, Configuration conf) throws IOException {
FileSystem fs = FileSystem.get(some_path.toUri(), conf);
File temp_data_file = File.createTempFile(some_path.getName(), "");
temp_data_file.deleteOnExit();
fs.copyToLocalFile(some_path, new Path(temp_data_file.getAbsolutePath()));
return temp_data_file;
}
}

If you get a LocalFileSystem
final LocalFileSystem localFileSystem = FileSystem.getLocal(configuration);
You can pass your hadoop Path object to localFileSystem.pathToFile
final File localFile = localFileSystem.pathToFile(<your hadoop Path>);

Not that I'm aware of.
To my understanding, a Path in Hadoop represents an identifier for a node in their distributed filesystem. This is a different abstraction from a java.io.File, which represents a node on the local filesystem. It's unlikely that a Path could even have a File representation that would behave equivalently, because the underlying models are fundamentally different.
Hence the lack of translation. I presume by your assertion that File objects are "[more] useful", you want an object of this class in order to use existing library methods? For the reasons above, this isn't going to work very well. If it's your own library, you could rewrite it to work cleanly with Hadoop Paths and then convert any Files into Path objects (this direction works as Paths are a strict superset of Files). If it's a third party library then you're out of luck; the authors of that method didn't take into account the effects of a distributed filesystem and only wrote that method to work on plain old local files.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.