Parallelization of file and network I/O operations - java

The Questions:
Main Question: What's the best strategy to parallel these jobs?
Ideas: How to speed up the process using other mechanisms like a second checksum (Adler32?)
The Szenario:
I'm writing kind of a synchronization tool in java. Basically it downloads a repository from a webserver which represents the file/directory structure on the local machine and defines sources for the needed files in compressed form combined with hash values to verify files. A basic thing i guess.
Requirements:
Multi-platform java desktop application
Best possible speed and parallelization
Example structure: (best described using mods of a game)
Example Repository File
{"name":"subset1", "mods":[
{
"modfolder":"mod1",
"modfiles":[
{
"url":"http://www.example.com/file2.7z",
"localpath":"mod1/file2",
"size":5,
"sizecompressed":3,
"checksum":"46aabad952db3e21e273ce"
},
{
"url":"http://www.example.com/file1.7z",
"localpath":"mod1/file1",
"size":9,
"sizecompressed":4,
"checksum":"862f90bafda118c4d3c5ee6477"
}
]
},
{
"modfolder":"mod2",
"modfiles":[
{
"url":"http://www.example.com/file3.7z",
"localpath":"mod2/file3",
"size":8,
"sizecompressed":4,
"checksum":"cb1e69de0f75a81bbeb465ee0cdd8232"
},
{
"url":"http://www.example.com/file1.7z",
"localpath":"mod2/file1",
"size":9,
"sizecompressed":4,
"checksum":"862f90bafda118c4d3c5ee6477"
}
]
}
]}
Client file structure, as it should be after sync
mod1/
file2
file1
mod2/
file3
file1
// mod1/file2 == mod2/file2
A special thing about the repository:
The Repository got from the server represents only subsets of a bigger repository, because the user only needs a subtree, which is changing (also overlapping).
Sometimes the Repository consists of mod1 and mod2, sometimes mod1 and mod3 and so on.
Work to be done:
Download Repository and parse it (Net I/O)
Mark files not in the repository for deletion at the end of the process (files may be copied because of same checksum) (File I/O)
If file exists: Check checksum of existing file (checksum cache) (File I/O)
If file not exists: Check checksumcache for identical files in other subtrees to copy the file instead of downloading it (Light file I/O)
Download single file in compressed form (Net I/O)
Extract compressed file (File I/O)
Checksum of uncompressed file (File I/O)
Cache checksum associated with file. (Light file I/O)
My solution: (many different producers/consumers)
The Checksum cache is using MapDBs persistent maps.
ATM only md5 checksum is used.
Queues: Every Workertype has a blocking queue (producer/consumer)
Thread Pools: Every Workertype has a fixed Threadpool e.g. 3 Downloader, 2 Checksum, ...
Workers distribute the current job to other queues: Downloader -> Extract -> Checksum
Workertypes:
Localfile Worker: Checks local file structure (using checksum cache),
redirects work to Download-Worker, Delete-Worker
Copy: Copies a file with same checksum to destination
Download: Downloads a file
Checksum: Checksum a file and inserts in checksumcache
Delete: Delete a file
Extract: Extracts a compressed file

What's the best strategy to parallel these jobs?
You have I/O. And, probably, if one job is already in progress on one directory, another job cannot be run on the same directory at the same time.
So, you need locking here. Recommendation: use a locking directory on the filesystem, and use directories, not files, to lock. Why? Because directory creation is atomic (first reason), and because Java 6 does not support atomic file creation (second reason). In fact, you may even need two locking directories: one for content download, another for content processing.
The separation of download vs processing you have already done, so I have nothing more to say here ;)
I am not sure why you want to cache checksums however? It doesn't look that useful to me...
Also, I don't know how big the files you have to deal with are, but why bother with checking the existing directory contents etc vs extract the new directory and rename? Ie:
extract new directory in newdir;
checksums;
move dstdir to dstdir.old;
move newdir to dstdir;
scrap dstdir.old.
This even means you could parallelize scrapping, but that is too much I/O parallelization... You'll have to limit the number of threads doing actual I/O.
EDIT Here is how I would separate processing:
first of all, no checksums anymore on the archive itself, but there is a file in the archive which contains the MD5 sums of each file (for instance, MD5SUMS);
two blocking queues: download -> replace, replace -> scrapping;
one processor takes care of downloading; when it is done, it fills the download -> replace queue;
another processor picks a task from the download -> replace queue; this task performs, in order, unarchive and checksumming; if both are correct, as mentioned above, it renames the existing directory, renames the extracted directory to the expected directory, and puts a scrapping task on the replace -> scrappint queue;
the third, and last, processor, picks a task from the scrapping queue and performs deletion of the previous archive.
Note that the checksumming, if it is that heavy, could be parallelized.

Related

How to copy multiple files atomically from src to dest in java?

in one requirement, i need to copy multiple files from one location to another network location.
let assume that i have the following files present in the /src location.
a.pdf, b.pdf, a.doc, b.doc, a.txt and b.txt
I need to copy a.pdf, a.doc and a.txt files atomically into /dest location at once.
Currently i am using Java.nio.file.Files packages and code as follows
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1 = Paths.get("/dest/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2 = Paths.get("/dest/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3 = Paths.get("/dest/a.txt");
Files.copy(srcFile1, destFile1);
Files.copy(srcFile2, destFile2);
Files.copy(srcFile3, destFile3);
but this process the file are copied one after another.
As an alternate to this, in order to make whole process as atomic,
i am thinking of zipping all the files and move to /dest and unzip at the destination.
is this approach is correct to make whole copy process as atomic ? any one experience similar concept and resolved it.
is this approach is correct to make whole copy process as atomic ? any one experience similar concept and resolved it.
You can copy the files to a new temporary directory and then rename the directory.
Before renaming your temporary directory, you need to delete the destination directory
If other files are already in the destination directory that you don't want to overwrite, you can move all files from the temporary directory to the destination directory.
This is not completely atomic, however.
With removing /dest:
String tmpPath="/tmp/in/same/partition/as/source";
File tmp=new File(tmpPath);
tmp.mkdirs();
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1 = Paths.get(tmpPath+"/dest/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2 = Paths.get(tmpPath+"/dest/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3 = Paths.get(tmpPath+"/dest/a.txt");
Files.copy(srcFile1, destFile1);
Files.copy(srcFile2, destFile2);
Files.copy(srcFile3, destFile3);
delete(new File("/dest"));
tmp.renameTo("/dest");
void delete(File f) throws IOException {
if (f.isDirectory()) {
for (File c : f.listFiles())
delete(c);
}
if (!f.delete())
throw new FileNotFoundException("Failed to delete file: " + f);
}
With just overwriting the files:
String tmpPath="/tmp/in/same/partition/as/source";
File tmp=new File(tmpPath);
tmp.mkdirs();
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1=paths.get("/dest/a.pdf");
Path tmp1 = Paths.get(tmpPath+"/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2=Paths.get("/dest/a.doc");
Path tmp2 = Paths.get(tmpPath+"/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3=Paths.get("/dest/a.txt");
Path destFile3 = Paths.get(tmpPath+"/a.txt");
Files.copy(srcFile1, tmp1);
Files.copy(srcFile2, tmp2);
Files.copy(srcFile3, tmp3);
//Start of non atomic section(it can be done again if necessary)
Files.deleteIfExists(destFile1);
Files.deleteIfExists(destFile2);
Files.deleteIfExists(destFile2);
Files.move(tmp1,destFile1);
Files.move(tmp2,destFile2);
Files.move(tmp3,destFile3);
//end of non-atomic section
Even if the second method contains a non-atomic section, the copy process itself uses a temporary directory so that the files are not overwritten.
If the process aborts during moving the files, it can easily be completed.
See https://stackoverflow.com/a/4645271/10871900 as reference for moving files and https://stackoverflow.com/a/779529/10871900 for recursively deleting directories.
First there are several possibilities to copy a file or a directory. Baeldung gives a very nice insight into different possibilities. Additionally you can also use the FileCopyUtils from Spring. Unfortunately, all these methods are not atomic.
I have found an older post and adapt it a little bit. You can try using the low-level transaction management support. That means you make a transaction out of the method and define what should be done in a rollback. There is also a nice article from Baeldung.
#Autowired
private PlatformTransactionManager transactionManager;
#Transactional(rollbackOn = IOException.class)
public void copy(List<File> files) throws IOException {
TransactionDefinition transactionDefinition = new DefaultTransactionDefinition();
TransactionStatus transactionStatus = transactionManager.getTransaction(transactionDefinition);
TransactionSynchronizationManager.registerSynchronization(new TransactionSynchronization() {
#Override
public void afterCompletion(int status) {
if (status == STATUS_ROLLED_BACK) {
// try to delete created files
}
}
});
try {
// copy files
transactionManager.commit(transactionStatus);
} finally {
transactionManager.rollback(transactionStatus);
}
}
Or you can use a simple try-catch-block. If an exception is thrown you can delete the created files.
Your question lacks the goal of atomicity. Even unzipping is never atomic, the VM might crash with OutOfMemoryError right in between inflating the blocks of the second file. So there's one file complete, a second not and a third entirely missing.
The only thing I can think of is a two phase commit, like all the suggestions with a temporary destination that suddenly becomes the real target. This way you can be sure, that the second operation either never occurs or creates the final state.
Another approach would be to write a sort of cheap checksum file in the target afterwards. This would make it easy for an external process to listen for creation of such files and verify their content with the files found.
The latter would be the same like offering the container/ ZIP/ archive right away instead of piling files in a directory. Most archives have or support integrity checks.
(Operating systems and file systems also differ in behaviour if directories or folders disappear while being written. Some accept it and write all data to a recoverable buffer. Others still accept writes but don't change anything. Others fail immediately upon first write since the target block on the device is unknown.)
FOR ATOMIC WRITE:
There is no atomicity concept for standard filesystems, so you need to do only single action - that would be atomic.
Therefore, for writing more files in an atomic way, you need to create a folder with, let's say, the timestamp in its name, and copy files into this folder.
Then, you can either rename it to the final destination or create a symbolic link.
You can use anything similar to this, like file-based volumes on Linux, etc.
Remember that deleting the existing symbolic link and creating a new one will never be atomic, so you would need to handle the situation in your code and switch to the renamed/linked folder once it's available instead of removing/creating a link. However, under normal circumstances, removing and creating a new link is a really fast operation.
FOR ATOMIC READ:
Well, the problem is not in the code, but on the operation system/filesystem level.
Some time ago, I got into a very similar situation. There was a database engine running and changing several files "at once". I needed to copy the current state, but the second file was already changed before the first one was copied.
There are two different options:
Use a filesystem with support for snapshots. At some moment, you create a snapshot and then copy files from it.
You can lock the filesystem (on Linux) using fsfreeze --freeze, and unlock it later with fsfreeze --unfreeze. When the filesystem is frozen, you can read the files as usual, but no process can change them.
None of these options worked for me as I couldn't change the filesystem type, and locking the filesystem wasn't possible (it was root filesystem).
I created an empty file, mount it as a loop filesystem, and formatted it. From that moment on, I could fsfreeze just my virtual volume without touching the root filesystem.
My script first called fsfreeze --freeze /my/volume, then perform the copy action, and then called fsfreeze --unfreeze /my/volume. For the duration of the copy action, the files couldn't be changed, and so the copied files were all exactly from the same moment in time - for my purpose, it was like an atomic operation.
Btw, be sure to not fsfreeze your root filesystem :-). I did, and restart is the only solution.
DATABASE-LIKE APPROACH:
Even databases cannot rely on atomic operations, and so they first write the change to WAL (write-ahead log) and flush it to the storage. Once it's flushed, they can apply the change to the data file.
If there is any problem/crash, the database engine first loads the data file and checks whether there are some unapplied transactions in WAL and eventually apply them.
This is also called journaling, and it's used by some filesystems (ext3, ext4).
I hope this solution would be useful : as per my understanding you need to copy the files from one directory to another directory.
so my solution is as follows:
Thank You.!!
public class CopyFilesDirectoryProgram {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String sourcedirectoryName="//mention your source path";
String targetdirectoryName="//mention your destination path";
File sdir=new File(sourcedirectoryName);
File tdir=new File(targetdirectoryName);
//call the method for execution
abc (sdir,tdir);
}
private static void abc(File sdir, File tdir) throws IOException {
if(sdir.isDirectory()) {
copyFilesfromDirectory(sdir,tdir);
}
else
{
Files.copy(sdir.toPath(), tdir.toPath());
}
}
private static void copyFilesfromDirectory(File source, File target) throws IOException {
if(!target.exists()) {
target.mkdir();
}else {
for(String items:source.list()) {
abc(new File(source,items),new File(target,items));
}
}
}
}

Synchronize files processing across cluster

I run a cluster containing 2 or more instances of the same microservice.
Each of them access files on a shared data share, which in mounted as a local folder on both servers running microservices. Each file can be processed only once(in the entire cluster).
I want to have those files processed in parellel by nodes, so no file is being
more than once in the entire cluster.
Looking for idea how to solve it
I already thought about one node reading the files and putting their filenames into queue, so that nodes can read it from queue.
Also thought about synchronizing via database, where each node when trying to process file uses db to synchronize with other nodes.
Any idea how to solve it in a good manner?
something like this might work:
String pathToFile = "/tmp/foo.txt";
try {
Files.createFile(FileSystems.getDefault().getPath(pathToFile + ".claimed"));
processFile(pathToFile);
} catch (FileAlreadyExistsException e) {
// some other app has already claimed "filename"
}
and you'll need these imports:
import java.nio.file.FileAlreadyExistsException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
The idea is that each app instance agrees to work on any given file only if it is first able to create a ".claimed" file in the same shared filesystem. This works because of behavior of Files.createFile:
Creates a new and empty file, failing if the file already exists. The check for the existence of the file and the creation of the new file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the directory.
(from this Javadoc:
https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#createFile(java.nio.file.Path,%20java.nio.file.attribute.FileAttribute...) )

Upload retry mechanism using JSch library

I have a file to upload (say abc.pdf). Very first time I want to upload this file as a temp file (say abc.pdf.temp). Then , if the file is successfully transferred (fully transferred) then I need to rename it to its original name (abc.pdf). But if the file is not fully transferred then I need to delete the temp file that I uploaded initially since I don't want to keep a corrupted file in the server. Is this achievable to do using this JSch library. Below is the sample code. Does this code make sense to achieve this?
Sample Code:
originalFile = 'abc.pdf';
tempFile = 'abc.pdf.temp';
fileInputStream = createobject("java", "java.io.FileInputStream").init('C:\abc.pdf');
SftpChannel.put(fileInputStream,tempFile);
// Comparing remote file size with local file
if(SftpChannel.lstat(tempFile).getSize() NEQ localFileSize){
// Allow to Resume the file transfer since the file size is different
SftpChannel.put(fileInputStream,tempFile,SftpChannel.RESUME);
if(SftpChannel.lstat(tempFile).getSize() NEQ localFileSize){
// Check again if the file is not fully transferred (During RESUME) then
// deleting the file since dont want to keep a corrupted file in the server.
SftpChannel.rm(tempFile);
}
}else{//assuming file is fully transferred
SftpChannel.rename(tempFile ,originalFile);
}
It's very unlikely that after the put finishes without throwing, the file size won't match. It can hardly happen. Even if it happens, it makes little sense to call RESUME. If something catastrophic goes wrong that is not detected by put, RESUME is not likely to help.
And even if you want to try with RESUME, it does not make sense to try once. If you believe it makes sense to retry, you have to keep retrying until you succeed, not only once.
You should catch exception and resume/delete/whatever. That's the primary recovery mechanism. This is 100x more likely to happen than 1.

Multiple directories as Input format in hadoop map reduce

I am trying to run a graph verifier app in distributed system using hadoop.
I have the input in the following format:
Directory1
---file1.dot
---file2.dot
…..
---filen.dot
Directory2
---file1.dot
---file2.dot
…..
---filen.dot
Directory670
---file1.dot
---file2.dot
…..
---filen.dot
.dot files are files storing the graphs.
Is it enough for me to add the input directories path using FileInputFormat.addInputPath()?
I want hadoop to process the contents of each directory in same node because the files present in each directory contains data that depends on the presence of other files of the same directory.
Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
The files in each directory is dependent on each other for data(to be precise...
each directory contains a file(main.dot which has acyclic graph whose vertices are the names of the rest of the files,
so my verifier will traverse each vertex of graph present in main.dot, search for the file of the same name in the same directory and if found processes the data in that file.
similarly all the files will be processed and the combined output after processing each file in the directory is displayed,
same procedure goes for rest of the directories.)
Cutting long story short
As in famous word count application(if the input is a single book), hadoop will split the input and distribute the task to each node in the cluster where the mapper process each line and count the relevant word.
How can i split the task here(do i need to split by the way?)
How can i leverage hadoop power for this scenario, some sample code template will help for sure:)
The soln given by Alexey Shestakov will work. But it is not leveraging MapReduce's distributed processing framework. Probably only one map process will read the file ( file containing paths of all input files) and then process the input data.
How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories?
One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class.
use MultipleInputs.addInputPath() to add the directories and map class for each directory path. Now each mapper can get one directory and process all files within it.
You can create a file with list of all directories to process:
/path/to/directory1
/path/to/directory2
/path/to/directory3
Each mapper would process one directory, for example:
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
// process file
}
}
Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
No, it won't. Files are not distributed to each node in the sense that the files are copied to the node to be processed. Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. The datanode then pulls that file from HDFS and processes it.
There's no reason why you can't just open other files you may need directly from HDFS.

Copy task does not put copied files into its TaskOutputs - why?

With one file build.gradle in directory in and the following tasks:
task cpy(type: Copy) {
from 'in'
into 'out'
}
tast testIn << {
println cpy.inputs.files.files
}
task testOut << {
println cpy.outputs.files.files
}
Why does gradle testOut only print:
[...\out]
when gradle testIn prints:
[...\in\build.gradle]
Clearly there's an inconsistency here. The task input specifies the exact files that it has copied, but the output only specifies the directory to which it has copied the files, not the files themselves. Is this on purpose?
I can think of numerous cases where knowing the final paths of the files copied is useful. One would be when undoing a copy operation; without the actual file paths after copy one has to manually construct them by traversing the input files and appending their names to the output path. And what about Gradles "up-to-date" functionality - if cpy.outputs is the whole directory, even though it only copied one file, then the snapshot taken by Gradle covers way more than it should.
The outputs of the Copy task are currently defined as a single output directory. It's a known limitation.

Categories

Resources