Which replica of input block is processed in mapper? - java

I create a simple I/O monitoring system for MapReduce jobs written in Java. So at Map stage of the job I want to log information about locations of processed data.
MapReduce job at Map stage process input split which consists of several file blocks in HDFS.
That blocks have several (usually 3) replicas.
Does it possible to know which replicas of these blocks have been used while reading in Mapper?
In other words can I get full path to the particular file in local file system from which Mapper reads?

In HDFS the blocks are replicated and the namenode do not have any information on which is replica. It uses a block to perform an operation based on network latency and load in that specific machine.
The file in HDFS is divided into blocks. The full path of file in hdfs is stored as namenode metadata. Each block is identified by a block id.
The value of the property dfs.namenode.name.dir in hdfs-site.xml gives the location on where all the blocks are stored.
Based on your requirement, if you want to get the path in local file system where the block is stored, read the value of this property, identify the block id by reading namenode metadata, then you'll be able to find the exact block in local file system that refers to a hdfs file programmatically.

Related

Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode

I am running spark job in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.
dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath);
I am trying to understand how spark writes multiple part files on each worker node.
Run1) worker1 has part files and SUCCESS ; worker2 has _temporarty/task*/part* each task has the part files run.
Run2) worker1 has part files and also _temporary directory; worker2 has multiple part files
Can anyone help me understand why is this behavior?
1)Should I consider the records in outputDir/_temporary as part of the output file along with the part files in outputDir?
2)Is _temporary dir supposed to be deleted after job run and move the part files to outputDir?
3)why can't it create part files directly under ouput dir?
coalesce(1) and repartition(1) cannot be the option since the outputDir file itself will be around 500GB
Spark 2.0.2. 2.1.3 and Java 8, no HDFS
After analysis, observed that my spark job is using fileoutputcommitter version 1 which is default.
Then I included config to use fileoutputcommitter version 2 instead of version 1 and tested in 10 node spark standalone cluster in AWS. All part-* files are generated directly under outputDirPath specified in the dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath)
We can set the property
By including the same as --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' in spark-submit command
or set the property using sparkContext javaSparkContext.hadoopConifiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")
I understand the consequence in case of failures as outlined in the spark docs, but I achieved the desired result!
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version, defaultValue is 1
The
file output committer algorithm version, valid algorithm version
number: 1 or 2. Version 2 may have better performance, but version 1
may handle failures better in certain situations, as per
MAPREDUCE-4815.
TL;DR To properly write (or read for that matter) data using file system based source you'll need a shared storage.
_temporary directory is a part of basic commit mechanism used by Spark - data is first written to a temporary directory, and once all task finished, atomically moved to the final destination. You can read more about this process in Spark _temporary creation reason
For this process to be successful you need a shared file system (HDFS, NFS, and so on) or equivalent distributed storage (like S3). Since you don't have one, failure to clean temporary state is expected - Saving dataframe to local file system results in empty results.
The behavior you observed (data partially committed and partially not) can occur, when some executors are co-located with the driver and share file system with the driver, enabling full commit for the subset of data.
Multiple part files are based on your dataframe partition. The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data.
you can control it by using coalesce or repartition. you can reduce the partition or increase it.
if you make coalesce of 1, then you wont see multiple part files in it but this affects writing Data in Parallel.
[outputDirPath = /tmp/multiple.csv ]
dataframe
.coalesce(1)
.write.option("header","false")
.mode(SaveMode.Overwrite)
.csv(outputDirPath);
on your question on how to refer it..
refer as /tmp/multiple.csv for all below parts.
/tmp/multiple.csv/part-00000.csv
/tmp/multiple.csv/part-00001.csv
/tmp/multiple.csv/part-00002.csv
/tmp/multiple.csv/part-00003.csv

process small file map reduce hadoop

I have a 456kb file which is being read from hdfs and its given as input to mapper function. Every line contain a integer for which I am downloading some files and storing them on local system. I have hadoop set up on two-node cluster and the split size is changed from the program to open 8-mappers :
Configuration configuration = new Configuration();
configuration.setLong("mapred.max.split.size", 60000L);
configuration.setLong("mapred.min.split.size", 60000L);
8 mappers are created but same data is downloaded on both the servers, I think its happening because block size is still set to default 256mb and input file is processed twice. So my question is can we process a small size file with map reduce?
If your download of files take time, you might have suffered from what's called speculative execution of Hadoop, which is by default enabled. It's just a guess though, since, you said you are getting same files downloaded more than once.
With speculative execution turn on the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform.
You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.

Hadoop input from different servers

I have one master node and two data nodes which are in different servers. For the two data node, each of them has a log file in its own HDFS. Now I want to run the Hadoop to do a map/reduce on master node and the input should be the two log files from the two data nodes' HDFS. Can I do this? If I can, how can I set the input path ? (e.g. hadoop jar wordcount.jar datanode1/input/logfile1 datanode2/input/logfile2 output ...like this?) Is it possible that the input from the different datanode's HDFS which are in different servers?
When you say Hadoop, there is nothing like its own HDFS. HDFS is a distributed FS and is spread across all the machines in a Hadoop cluster functioning as a single FS.
You just have to put both the files inside one HDFS directory and give this directory as input to you MapReduce job.
FileInputFormat.addInputPath(job, new Path("/path/to/the/input/directory"));
Same holds true for MapReduce jobs. Although you submit your job to JobTracker, the job actually runs in a distributed fashion on all the nodes of your cluster, where data to processed is present.
Oh, one more thing...A file in HDFS is not stored as a whole on any particular machine. It gets chopped into small blocks of 64MB(configurable) and these blocks are stored on different machines randomly across your cluster.

Access to HDFS files from all computers of a cluster

My hadoop the program originally was launched in a local mode, and now my purpose became start in completely distributed mode. For this purpose it is necessary to provide access to the files which reading is executed in the reducer and mapper functions, from all computers of a cluster and therefore I asked a question on http://answers.mapr.com/questions/4444/syntax-of-option-files-in-hadoop-script (also as it will be not known on what computer to be executed the mapper function (mapper from logic of the program there will be only one and the program will be launched only with one mapper), it is necessary to provide also access on all cluster to the file arriving on an input of the mapper function). In this regard I had a question: Whether it is possible to use hdfs-files directly: that is to copy beforehand files from file system of Linux in file system of HDFS (thereby as I assume, these files become available on all computers of a cluster if it not so, correct please) and then to use HDFS Java API for reading these files, in the reducer and mapper functions which are executing on computers of a cluster?
If on this question the response the positive, give please a copying example from file system of Linux in file system of HDFS and reading these files in java to the program by means of HDFS Java API and and record of its contents at java-string.
Copy all your input files to the master node (this can be done using scp).
Then login to your master node (ssh) and execute something like following to copy files from local filesystem to hdfs:
hadoop fs -put $localfilelocation $destination
Now in your hadoop jobs, you may use the input to be hdfs:///$destination. No need to use any extra API to read from HDFS.
If you really want to read files from HDFS and use as addiotional information other than the input files, then refer this.

Need to get rid of part-m-0000* files in HDFS

In HDFS processing after each job empty files are created with names like part-m-0000*. Each of these files are empty but they are consuming 64MB of disk space because that is default size of block.
It is necessary to make code changes to skip creation of these files. How do I do this?
Note: I am using org.apache.hadoop.mapreduce.lib.output.MultipleOutputs<KEYOUT,VALUEOUT> to write output records, and not Context, so I anyways end up with output records in files like "successful-m-00000" etc.
According to the Hadoop : The Definitive Guide, so the underlying file system will not take a HDFS block size if the file is empty.
Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
For suppressing the output files if they are empty, use LazyOutputFormat#setOutputFormatClass. Here is the Apache documentation for the same.

Categories

Resources