Multiple directories as Input format in hadoop map reduce

Multiple directories as Input format in hadoop map reduce - java

I am trying to run a graph verifier app in distributed system using hadoop.
I have the input in the following format:
Directory1
---file1.dot
---file2.dot
…..
---filen.dot
Directory2
---file1.dot
---file2.dot
…..
---filen.dot
Directory670
---file1.dot
---file2.dot
…..
---filen.dot
.dot files are files storing the graphs.
Is it enough for me to add the input directories path using FileInputFormat.addInputPath()?
I want hadoop to process the contents of each directory in same node because the files present in each directory contains data that depends on the presence of other files of the same directory.
Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
The files in each directory is dependent on each other for data(to be precise...
each directory contains a file(main.dot which has acyclic graph whose vertices are the names of the rest of the files,
so my verifier will traverse each vertex of graph present in main.dot, search for the file of the same name in the same directory and if found processes the data in that file.
similarly all the files will be processed and the combined output after processing each file in the directory is displayed,
same procedure goes for rest of the directories.)
Cutting long story short
As in famous word count application(if the input is a single book), hadoop will split the input and distribute the task to each node in the cluster where the mapper process each line and count the relevant word.
How can i split the task here(do i need to split by the way?)
How can i leverage hadoop power for this scenario, some sample code template will help for sure:)

The soln given by Alexey Shestakov will work. But it is not leveraging MapReduce's distributed processing framework. Probably only one map process will read the file ( file containing paths of all input files) and then process the input data.
How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories?
One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class.
use MultipleInputs.addInputPath() to add the directories and map class for each directory path. Now each mapper can get one directory and process all files within it.

You can create a file with list of all directories to process:
/path/to/directory1
/path/to/directory2
/path/to/directory3
Each mapper would process one directory, for example:
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
// process file
}
}

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
No, it won't. Files are not distributed to each node in the sense that the files are copied to the node to be processed. Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. The datanode then pulls that file from HDFS and processes it.
There's no reason why you can't just open other files you may need directly from HDFS.

Related

Synchronize files processing across cluster

I run a cluster containing 2 or more instances of the same microservice.
Each of them access files on a shared data share, which in mounted as a local folder on both servers running microservices. Each file can be processed only once(in the entire cluster).
I want to have those files processed in parellel by nodes, so no file is being
more than once in the entire cluster.
Looking for idea how to solve it
I already thought about one node reading the files and putting their filenames into queue, so that nodes can read it from queue.
Also thought about synchronizing via database, where each node when trying to process file uses db to synchronize with other nodes.
Any idea how to solve it in a good manner?

something like this might work:
String pathToFile = "/tmp/foo.txt";
try {
Files.createFile(FileSystems.getDefault().getPath(pathToFile + ".claimed"));
processFile(pathToFile);
} catch (FileAlreadyExistsException e) {
// some other app has already claimed "filename"
}
and you'll need these imports:
import java.nio.file.FileAlreadyExistsException;
import java.nio.file.FileSystems;
import java.nio.file.Files;
The idea is that each app instance agrees to work on any given file only if it is first able to create a ".claimed" file in the same shared filesystem. This works because of behavior of Files.createFile:
Creates a new and empty file, failing if the file already exists. The check for the existence of the file and the creation of the new file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the directory.
(from this Javadoc:
https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#createFile(java.nio.file.Path,%20java.nio.file.attribute.FileAttribute...) )

Reading a file from tar.gz archive in Spark

I have a bunch of tar.gz files which I would like to process with Spark without decompressing them.
A single archive is about ~700MB and contains 10 different files but I'm interested only in one of them (which is ~7GB after decompression).
I know that context.textFile supports tar.gz but I'm not sure is it the right tool when an archive contains more then one file. What happens is that Spark will return content of all files (line by line) in the archive including file names with some binary data.
Is there any way to select which file from tar.gz I would like to map?

AFAIK, I'd suggest sc.binaryFiles method... please see below doc. where file name and file content are present, you can map and pickup the file you want and process that.
public RDD<scala.Tuple2<String,PortableDataStream>> binaryFiles(String path,
int minPartitions)
Get an RDD for a Hadoop-readable dataset as PortableDataStream for each file (useful for binary data)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000
hdfs://a-hdfs-path/part-00001
...
hdfs://a-hdfs-path/part-nnnnn
Do val rdd = sparkContext.binaryFiles("hdfs://a-hdfs-path"),
then rdd contains
(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)
Also, check this

Merging multiple LZO compressed files on HDFS

Let's say I have this structure on HDFS:
/dir1
/dir2
/Name1_2015/
file1.lzo
file2.lzo
file3.lzo
/Name2_2015
file1.lzo
file2.lzo
Name1_2015.lzo
I would like to merge each file of each directory in 'dir2' and append the result to the file in /dir1/DirName.lzo
For example, for /dir1/dir2/Name1_2015, I want to merge file1.lzo, file2.lzo, file3.lzo and append it to /dir1/Name1_2015.lzo
Each files are LZO compressed.
How can I do it ?
Thanks

If you don't care much about parallelism here's a bash one-liner:
for d in `hdfs dfs -ls /dir2 | grep -oP '(?<=/)[^/]+$'` ; do hdfs dfs -cat /dir2/$d/*.lzo | lzop -d | lzop | hdfs dfs -put - /dir1/$d.lzo ; done
You can extract all files in parallel using map-reduce. But how do you create one archive from multiple files in parallel? As far as I know, it is not possible to write to a single HDFS file from multiple processes concurrently. So as it's not possible we come up with a single node solution anyway.

I would do this with Hive, as follows:
Rename the subdirectories to name=1_2015 and name=2_2015
CREATE EXTERNAL TABLE sending_table
(
all_content string
)
PARTITIONED BY (name string)
LOCATION "/dir1/dir2"
ROW FORMAT DELIMITED FIELDS TERMINATED BY {a column delimiter that you know doesn't show up in any of the lines}
Make a second table that looks like the first, named "receiving", but with no partitions, and in a different directory.
Run this:
SET mapreduce.job.reduces=1 # this guarantees it'll make one file
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true
insert into table receiving
select all_content from sending_table

You can try to archive all the individual LZO files into HAR (Hadoop Archive). I think its overhead to merge all the files into single LZO.

Parallelization of file and network I/O operations

The Questions:
Main Question: What's the best strategy to parallel these jobs?
Ideas: How to speed up the process using other mechanisms like a second checksum (Adler32?)
The Szenario:
I'm writing kind of a synchronization tool in java. Basically it downloads a repository from a webserver which represents the file/directory structure on the local machine and defines sources for the needed files in compressed form combined with hash values to verify files. A basic thing i guess.
Requirements:
Multi-platform java desktop application
Best possible speed and parallelization
Example structure: (best described using mods of a game)
Example Repository File
{"name":"subset1", "mods":[
{
"modfolder":"mod1",
"modfiles":[
{
"url":"http://www.example.com/file2.7z",
"localpath":"mod1/file2",
"size":5,
"sizecompressed":3,
"checksum":"46aabad952db3e21e273ce"
},
{
"url":"http://www.example.com/file1.7z",
"localpath":"mod1/file1",
"size":9,
"sizecompressed":4,
"checksum":"862f90bafda118c4d3c5ee6477"
}
]
},
{
"modfolder":"mod2",
"modfiles":[
{
"url":"http://www.example.com/file3.7z",
"localpath":"mod2/file3",
"size":8,
"sizecompressed":4,
"checksum":"cb1e69de0f75a81bbeb465ee0cdd8232"
},
{
"url":"http://www.example.com/file1.7z",
"localpath":"mod2/file1",
"size":9,
"sizecompressed":4,
"checksum":"862f90bafda118c4d3c5ee6477"
}
]
}
]}
Client file structure, as it should be after sync
mod1/
file2
file1
mod2/
file3
file1
// mod1/file2 == mod2/file2
A special thing about the repository:
The Repository got from the server represents only subsets of a bigger repository, because the user only needs a subtree, which is changing (also overlapping).
Sometimes the Repository consists of mod1 and mod2, sometimes mod1 and mod3 and so on.
Work to be done:
Download Repository and parse it (Net I/O)
Mark files not in the repository for deletion at the end of the process (files may be copied because of same checksum) (File I/O)
If file exists: Check checksum of existing file (checksum cache) (File I/O)
If file not exists: Check checksumcache for identical files in other subtrees to copy the file instead of downloading it (Light file I/O)
Download single file in compressed form (Net I/O)
Extract compressed file (File I/O)
Checksum of uncompressed file (File I/O)
Cache checksum associated with file. (Light file I/O)
My solution: (many different producers/consumers)
The Checksum cache is using MapDBs persistent maps.
ATM only md5 checksum is used.
Queues: Every Workertype has a blocking queue (producer/consumer)
Thread Pools: Every Workertype has a fixed Threadpool e.g. 3 Downloader, 2 Checksum, ...
Workers distribute the current job to other queues: Downloader -> Extract -> Checksum
Workertypes:
Localfile Worker: Checks local file structure (using checksum cache),
redirects work to Download-Worker, Delete-Worker
Copy: Copies a file with same checksum to destination
Download: Downloads a file
Checksum: Checksum a file and inserts in checksumcache
Delete: Delete a file
Extract: Extracts a compressed file

What's the best strategy to parallel these jobs?
You have I/O. And, probably, if one job is already in progress on one directory, another job cannot be run on the same directory at the same time.
So, you need locking here. Recommendation: use a locking directory on the filesystem, and use directories, not files, to lock. Why? Because directory creation is atomic (first reason), and because Java 6 does not support atomic file creation (second reason). In fact, you may even need two locking directories: one for content download, another for content processing.
The separation of download vs processing you have already done, so I have nothing more to say here ;)
I am not sure why you want to cache checksums however? It doesn't look that useful to me...
Also, I don't know how big the files you have to deal with are, but why bother with checking the existing directory contents etc vs extract the new directory and rename? Ie:
extract new directory in newdir;
checksums;
move dstdir to dstdir.old;
move newdir to dstdir;
scrap dstdir.old.
This even means you could parallelize scrapping, but that is too much I/O parallelization... You'll have to limit the number of threads doing actual I/O.
EDIT Here is how I would separate processing:
first of all, no checksums anymore on the archive itself, but there is a file in the archive which contains the MD5 sums of each file (for instance, MD5SUMS);
two blocking queues: download -> replace, replace -> scrapping;
one processor takes care of downloading; when it is done, it fills the download -> replace queue;
another processor picks a task from the download -> replace queue; this task performs, in order, unarchive and checksumming; if both are correct, as mentioned above, it renames the existing directory, renames the extracted directory to the expected directory, and puts a scrapping task on the replace -> scrappint queue;
the third, and last, processor, picks a task from the scrapping queue and performs deletion of the previous archive.
Note that the checksumming, if it is that heavy, could be parallelized.

How to put a serialized object into the Hadoop DFS and get it back inside the map function?

I'm new to Hadoop and recently I was asked to do a test project using Hadoop.
So while I was reading BigData, happened to know about Pail. Now what I want to do is something like this. First create a simple object and then serialize it using Thrift and put that into the HDFS using Pail. Then I want to get that object inside the map function and do what ever I want. But I have no idea on getting tat object inside the map function.
Can someone please tell me of any references or explain how to do that?

I can think of three options:
Use the -files option and name the file in HDFS (preferable as the task tracker will download the file once for all jobs running on that node)
Use the DistributedCache (similar logic to the above), but you configure the file via some API calls rather than through the command line
Load the file directly from HDFS (less efficient as you're pulling the file over HDFS for each task)
As for some code, put the load logic into your mapper's setup(...) or configure(..) method (depending on whether you're using the new or old API) as follows:
protected void setup(Context context) {
// the -files option makes the named file available in the local directory
File file = new File("filename.dat");
// open file and load contents ...
// load the file directly from HDFS
FileSystem fs = FileSystem.get(context.getConfiguration());
InputStream hdfsInputStream = fs.open("/path/to/file/in/hdfs/filename.dat");
// load file contents from stream...
}
DistributedCache has some example code in the Javadocs

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.