How to distribute the initial input files to nodes in Hadoop MapReduce? - java

I have a hadoop cluster with two computers, One as a master and another one as a slave. My input data is present on the Local disk of Master and I have also copied the input data files in the HDFS system. Now my question is, if I run the MapReduce task on this cluster then the whole input file is present on only one system [ which i think is opposed to the MapReduce's basic principle of "Data Locality" ]. I would like to know if there is any mechanism to distribute/partition the initial files so that the input files can be distributed on the different nodes of the cluster.

Lets say your cluster is composed of Node 1 and Node 2. If Node 1 is master, then there is no Datanode running on that node. So you have only a Datanode on Node 2, so I'm not sure what you mean when you say "so that the input files can be distributed on the different nodes of the cluster" because with your current setup, you just have 1 node on which data can be stored.
But if you consider a generic n node cluster, then if you copy data into HDFS, then data is distributed onto different nodes of the cluster by hadoop itself, so you wouldn't have to worry about that.

Related

How to set memory the master node more than workers nodes' memory in H2O?

In my algorithm, master node needs more memory (say 20GB) while the worker nodes need much less memory (say 3GB). However, as far as I know, in H2O it is only possibly to set the master node the same memory as worker nodes using -mapperXmx.
In Apache Spark, it is possible to specify the driver memory with --driver-memory argument. However, I have not been able to find a equivalent way to set "master/driver" node's memory in H2O.
I am running H2O (not Sparkling Water) on a Hadoop cluster (essentially on a YARN cluster) using this command:
hadoop jar h2o-hadoop-3/h2o-cdh6.3-assembly/build/libs/h2odriver-3.33.1.jar -nodes 5 -mapperXmx 3g -output my/output/dir/on/hdfs. This way I am able to specify worker nodes' memory as 3GB. However, I could not find the argument to specify the master node's memory. Is it possible to set the master node 20GB?
The master node is also known as the "driver" so yes, you can set the driver memory: spark.driver.memory Here's a complete list of settings you can tweak.

Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode

I am running spark job in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.
dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath);
I am trying to understand how spark writes multiple part files on each worker node.
Run1) worker1 has part files and SUCCESS ; worker2 has _temporarty/task*/part* each task has the part files run.
Run2) worker1 has part files and also _temporary directory; worker2 has multiple part files
Can anyone help me understand why is this behavior?
1)Should I consider the records in outputDir/_temporary as part of the output file along with the part files in outputDir?
2)Is _temporary dir supposed to be deleted after job run and move the part files to outputDir?
3)why can't it create part files directly under ouput dir?
coalesce(1) and repartition(1) cannot be the option since the outputDir file itself will be around 500GB
Spark 2.0.2. 2.1.3 and Java 8, no HDFS
After analysis, observed that my spark job is using fileoutputcommitter version 1 which is default.
Then I included config to use fileoutputcommitter version 2 instead of version 1 and tested in 10 node spark standalone cluster in AWS. All part-* files are generated directly under outputDirPath specified in the dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath)
We can set the property
By including the same as --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' in spark-submit command
or set the property using sparkContext javaSparkContext.hadoopConifiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")
I understand the consequence in case of failures as outlined in the spark docs, but I achieved the desired result!
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version, defaultValue is 1
The
file output committer algorithm version, valid algorithm version
number: 1 or 2. Version 2 may have better performance, but version 1
may handle failures better in certain situations, as per
MAPREDUCE-4815.
TL;DR To properly write (or read for that matter) data using file system based source you'll need a shared storage.
_temporary directory is a part of basic commit mechanism used by Spark - data is first written to a temporary directory, and once all task finished, atomically moved to the final destination. You can read more about this process in Spark _temporary creation reason
For this process to be successful you need a shared file system (HDFS, NFS, and so on) or equivalent distributed storage (like S3). Since you don't have one, failure to clean temporary state is expected - Saving dataframe to local file system results in empty results.
The behavior you observed (data partially committed and partially not) can occur, when some executors are co-located with the driver and share file system with the driver, enabling full commit for the subset of data.
Multiple part files are based on your dataframe partition. The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data.
you can control it by using coalesce or repartition. you can reduce the partition or increase it.
if you make coalesce of 1, then you wont see multiple part files in it but this affects writing Data in Parallel.
[outputDirPath = /tmp/multiple.csv ]
dataframe
.coalesce(1)
.write.option("header","false")
.mode(SaveMode.Overwrite)
.csv(outputDirPath);
on your question on how to refer it..
refer as /tmp/multiple.csv for all below parts.
/tmp/multiple.csv/part-00000.csv
/tmp/multiple.csv/part-00001.csv
/tmp/multiple.csv/part-00002.csv
/tmp/multiple.csv/part-00003.csv

How does Hadoop run the java reduce function on the DataNode's

I am confused on how the Datanode's in a hadoop cluster runs the java code for the reduce function of a job. Like, how does hadoop send a java code to another computer to execute?
Does Hadoop inject a java code to the nodes? If so, where is the java code located in hadoop?
Or are the reduce functions ran on the master node not the datanodes?
Help me trace this code where the Master Node sends the java code for the reduce function to a datanode.
As shown in the picture, here is what happens:
You run the job on client by using hadoop jar command in which you pass jar file name, class name and other parameters such as input and output
Client will get new application id and then it will copy the jar file and other job resources to HDFS with high replication factor (by default 10 on large clusters)
Then Client will actually submit the application through resource manager
Resource manager keeps track of cluster utilization and submit application master (which co-ordinates the job execution)
Application master will talk to namenode and determine where the blocks for input are located and then work with nodemanagers to submit the tasks (in the form of containers)
Containers are nothing but JVMs and they run map and reduce tasks (mapper and reducer classes), when the JVM is bootstrapped job resources that are on HDFS will be copied to the JVM. For mappers these JVMs will be created on same nodes on which data exists. Once the processing is started the jar file will be executed to process the data locally on that machine (typical).
To answer your question, reducer will be running on one or more data nodes as part of the containers. Java code will be copied as part of the bootstrap process (when JVM is created). Data will be fetched from mappers over the network.
No. Reduce functions are executed on data nodes.
Hadoop transfers packaged code (jar files) to the data node that are going to process data. At run time data nodes download these code and process task.

Hadoop input from different servers

I have one master node and two data nodes which are in different servers. For the two data node, each of them has a log file in its own HDFS. Now I want to run the Hadoop to do a map/reduce on master node and the input should be the two log files from the two data nodes' HDFS. Can I do this? If I can, how can I set the input path ? (e.g. hadoop jar wordcount.jar datanode1/input/logfile1 datanode2/input/logfile2 output ...like this?) Is it possible that the input from the different datanode's HDFS which are in different servers?
When you say Hadoop, there is nothing like its own HDFS. HDFS is a distributed FS and is spread across all the machines in a Hadoop cluster functioning as a single FS.
You just have to put both the files inside one HDFS directory and give this directory as input to you MapReduce job.
FileInputFormat.addInputPath(job, new Path("/path/to/the/input/directory"));
Same holds true for MapReduce jobs. Although you submit your job to JobTracker, the job actually runs in a distributed fashion on all the nodes of your cluster, where data to processed is present.
Oh, one more thing...A file in HDFS is not stored as a whole on any particular machine. It gets chopped into small blocks of 64MB(configurable) and these blocks are stored on different machines randomly across your cluster.

Access to HDFS files from all computers of a cluster

My hadoop the program originally was launched in a local mode, and now my purpose became start in completely distributed mode. For this purpose it is necessary to provide access to the files which reading is executed in the reducer and mapper functions, from all computers of a cluster and therefore I asked a question on http://answers.mapr.com/questions/4444/syntax-of-option-files-in-hadoop-script (also as it will be not known on what computer to be executed the mapper function (mapper from logic of the program there will be only one and the program will be launched only with one mapper), it is necessary to provide also access on all cluster to the file arriving on an input of the mapper function). In this regard I had a question: Whether it is possible to use hdfs-files directly: that is to copy beforehand files from file system of Linux in file system of HDFS (thereby as I assume, these files become available on all computers of a cluster if it not so, correct please) and then to use HDFS Java API for reading these files, in the reducer and mapper functions which are executing on computers of a cluster?
If on this question the response the positive, give please a copying example from file system of Linux in file system of HDFS and reading these files in java to the program by means of HDFS Java API and and record of its contents at java-string.
Copy all your input files to the master node (this can be done using scp).
Then login to your master node (ssh) and execute something like following to copy files from local filesystem to hdfs:
hadoop fs -put $localfilelocation $destination
Now in your hadoop jobs, you may use the input to be hdfs:///$destination. No need to use any extra API to read from HDFS.
If you really want to read files from HDFS and use as addiotional information other than the input files, then refer this.

Categories

Resources