I am confused on how the Datanode's in a hadoop cluster runs the java code for the reduce function of a job. Like, how does hadoop send a java code to another computer to execute?
Does Hadoop inject a java code to the nodes? If so, where is the java code located in hadoop?
Or are the reduce functions ran on the master node not the datanodes?
Help me trace this code where the Master Node sends the java code for the reduce function to a datanode.
As shown in the picture, here is what happens:
You run the job on client by using hadoop jar command in which you pass jar file name, class name and other parameters such as input and output
Client will get new application id and then it will copy the jar file and other job resources to HDFS with high replication factor (by default 10 on large clusters)
Then Client will actually submit the application through resource manager
Resource manager keeps track of cluster utilization and submit application master (which co-ordinates the job execution)
Application master will talk to namenode and determine where the blocks for input are located and then work with nodemanagers to submit the tasks (in the form of containers)
Containers are nothing but JVMs and they run map and reduce tasks (mapper and reducer classes), when the JVM is bootstrapped job resources that are on HDFS will be copied to the JVM. For mappers these JVMs will be created on same nodes on which data exists. Once the processing is started the jar file will be executed to process the data locally on that machine (typical).
To answer your question, reducer will be running on one or more data nodes as part of the containers. Java code will be copied as part of the bootstrap process (when JVM is created). Data will be fetched from mappers over the network.
No. Reduce functions are executed on data nodes.
Hadoop transfers packaged code (jar files) to the data node that are going to process data. At run time data nodes download these code and process task.
Related
I am running spark job in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.
dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath);
I am trying to understand how spark writes multiple part files on each worker node.
Run1) worker1 has part files and SUCCESS ; worker2 has _temporarty/task*/part* each task has the part files run.
Run2) worker1 has part files and also _temporary directory; worker2 has multiple part files
Can anyone help me understand why is this behavior?
1)Should I consider the records in outputDir/_temporary as part of the output file along with the part files in outputDir?
2)Is _temporary dir supposed to be deleted after job run and move the part files to outputDir?
3)why can't it create part files directly under ouput dir?
coalesce(1) and repartition(1) cannot be the option since the outputDir file itself will be around 500GB
Spark 2.0.2. 2.1.3 and Java 8, no HDFS
After analysis, observed that my spark job is using fileoutputcommitter version 1 which is default.
Then I included config to use fileoutputcommitter version 2 instead of version 1 and tested in 10 node spark standalone cluster in AWS. All part-* files are generated directly under outputDirPath specified in the dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath)
We can set the property
By including the same as --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' in spark-submit command
or set the property using sparkContext javaSparkContext.hadoopConifiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")
I understand the consequence in case of failures as outlined in the spark docs, but I achieved the desired result!
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version, defaultValue is 1
The
file output committer algorithm version, valid algorithm version
number: 1 or 2. Version 2 may have better performance, but version 1
may handle failures better in certain situations, as per
MAPREDUCE-4815.
TL;DR To properly write (or read for that matter) data using file system based source you'll need a shared storage.
_temporary directory is a part of basic commit mechanism used by Spark - data is first written to a temporary directory, and once all task finished, atomically moved to the final destination. You can read more about this process in Spark _temporary creation reason
For this process to be successful you need a shared file system (HDFS, NFS, and so on) or equivalent distributed storage (like S3). Since you don't have one, failure to clean temporary state is expected - Saving dataframe to local file system results in empty results.
The behavior you observed (data partially committed and partially not) can occur, when some executors are co-located with the driver and share file system with the driver, enabling full commit for the subset of data.
Multiple part files are based on your dataframe partition. The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data.
you can control it by using coalesce or repartition. you can reduce the partition or increase it.
if you make coalesce of 1, then you wont see multiple part files in it but this affects writing Data in Parallel.
[outputDirPath = /tmp/multiple.csv ]
dataframe
.coalesce(1)
.write.option("header","false")
.mode(SaveMode.Overwrite)
.csv(outputDirPath);
on your question on how to refer it..
refer as /tmp/multiple.csv for all below parts.
/tmp/multiple.csv/part-00000.csv
/tmp/multiple.csv/part-00001.csv
/tmp/multiple.csv/part-00002.csv
/tmp/multiple.csv/part-00003.csv
I'm running a Java code to copy the files from HDFS to local using Spark cluster mode in spark submit.
The job runs fine with spark local but fails in cluster mode.
It throws a java.io.exeception: Target /mypath/ is a directory.
I don't understand why is it failing in cluster. But I don't recieve any exceptions in local.
That behaviour is because in the first case (local) your driver is in the same machine that you are running the whole Spark job. In the second case (cluster), your driver program is shipped to one of your workers and execute the process from there.
In general, when you want to run Spark jobs as a cluster mode and you need to pre-process local files such as JSON, XML, among others, you need to ship them along with the executable using the following sentence --files <myfile>. Then in your driver program you will be able to see that particular file. If you want to include multiple files, put them separated by comma(,).
The approach is the same when you want to add some jars dependencies, you need to use --jars <myJars>.
For more details about this, check this thread.
I'm running a MapReduce job against about 3 million small files on Hadoop (I know, I know, but there's nothing we can do about it - it's the nature of our source system).
Our code is nothing special - it uses CombineFileInputFormat to wrap a bunch of these files together, then parses the file name to add it into the contents of the file, and spits out some results. Easy peasy.
So, we have about 3 million ~7kb files in HDFS. If we run our task against a small subset of these files (one folder, maybe 10,000 files), we get no trouble. If we run it against the full list of files, we get an out of memory error.
The error comes out on STDOUT:
#
# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 15690"...
I'm assuming what's happening is this - whatever JVM is running the process that defines the input splits is getting totally overwhelmed trying to handle 3 million files, it's using too much memory, and YARN is killing it. I'm willing to be corrected on this theory.
So, what I need to know how to do is to increase the memory limit for YARN for the container that's calculating the input splits, not for the mappers or reducers. Then, I need to know how to make this take effect. (I've Googled pretty extensively on this, but with all the iterations of Hadoop over the years, it's hard to find a solution that works with the most recent versions...)
This is Hadoop 2.6.0, using the MapReduce API, YARN framework, on AWS Elastic MapReduce 4.2.0.
I would spin up a new EMR cluster and throw a larger master instance at it to see if that is the issue.
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.4xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge
If the master is running out of memory when configuring the input splits you can modify the configuration
EMR Configuration
Instead of running the MapReduce on 3 million individual files, you can merge them into manageable bigger files using any of the following approaches.
1. Create Hadoop Archive ( HAR) files from the small files.
2. Create sequence file for every 10K-20K files using MapReduce program.
3. Create a sequence file from your individual small files using forqlift tool.
4. Merge your small files into bigger files using Hadoop-Crush.
Once you have the bigger files ready, you can run the MapReduce on your whole data set.
Need some help to understand how HDFS and Storm are integrated. Storm can process incoming stream of data using many nodes. My data is, let's say, log entries from different machines. So how do I store that all? Ideally I'd like to store logs from one machine to a one or many files dedicated to that machine. However does does it work? Will I be able to append to the same file in HDFS from many different Storm nodes?
PS: I still working on getting all this running so I can't test this physically... but it does bother me.
Write a file in hdfs with Java
No, you cannot write to the same file from more than one task at a time. Each task would need to write to it's own file in a directory and then you could process them using directory/* if you are using hadoop
My hadoop the program originally was launched in a local mode, and now my purpose became start in completely distributed mode. For this purpose it is necessary to provide access to the files which reading is executed in the reducer and mapper functions, from all computers of a cluster and therefore I asked a question on http://answers.mapr.com/questions/4444/syntax-of-option-files-in-hadoop-script (also as it will be not known on what computer to be executed the mapper function (mapper from logic of the program there will be only one and the program will be launched only with one mapper), it is necessary to provide also access on all cluster to the file arriving on an input of the mapper function). In this regard I had a question: Whether it is possible to use hdfs-files directly: that is to copy beforehand files from file system of Linux in file system of HDFS (thereby as I assume, these files become available on all computers of a cluster if it not so, correct please) and then to use HDFS Java API for reading these files, in the reducer and mapper functions which are executing on computers of a cluster?
If on this question the response the positive, give please a copying example from file system of Linux in file system of HDFS and reading these files in java to the program by means of HDFS Java API and and record of its contents at java-string.
Copy all your input files to the master node (this can be done using scp).
Then login to your master node (ssh) and execute something like following to copy files from local filesystem to hdfs:
hadoop fs -put $localfilelocation $destination
Now in your hadoop jobs, you may use the input to be hdfs:///$destination. No need to use any extra API to read from HDFS.
If you really want to read files from HDFS and use as addiotional information other than the input files, then refer this.