Hadoop input from different servers - java

I have one master node and two data nodes which are in different servers. For the two data node, each of them has a log file in its own HDFS. Now I want to run the Hadoop to do a map/reduce on master node and the input should be the two log files from the two data nodes' HDFS. Can I do this? If I can, how can I set the input path ? (e.g. hadoop jar wordcount.jar datanode1/input/logfile1 datanode2/input/logfile2 output ...like this?) Is it possible that the input from the different datanode's HDFS which are in different servers?

When you say Hadoop, there is nothing like its own HDFS. HDFS is a distributed FS and is spread across all the machines in a Hadoop cluster functioning as a single FS.
You just have to put both the files inside one HDFS directory and give this directory as input to you MapReduce job.
FileInputFormat.addInputPath(job, new Path("/path/to/the/input/directory"));
Same holds true for MapReduce jobs. Although you submit your job to JobTracker, the job actually runs in a distributed fashion on all the nodes of your cluster, where data to processed is present.
Oh, one more thing...A file in HDFS is not stored as a whole on any particular machine. It gets chopped into small blocks of 64MB(configurable) and these blocks are stored on different machines randomly across your cluster.

Related

How does Hadoop run the java reduce function on the DataNode's

I am confused on how the Datanode's in a hadoop cluster runs the java code for the reduce function of a job. Like, how does hadoop send a java code to another computer to execute?
Does Hadoop inject a java code to the nodes? If so, where is the java code located in hadoop?
Or are the reduce functions ran on the master node not the datanodes?
Help me trace this code where the Master Node sends the java code for the reduce function to a datanode.
As shown in the picture, here is what happens:
You run the job on client by using hadoop jar command in which you pass jar file name, class name and other parameters such as input and output
Client will get new application id and then it will copy the jar file and other job resources to HDFS with high replication factor (by default 10 on large clusters)
Then Client will actually submit the application through resource manager
Resource manager keeps track of cluster utilization and submit application master (which co-ordinates the job execution)
Application master will talk to namenode and determine where the blocks for input are located and then work with nodemanagers to submit the tasks (in the form of containers)
Containers are nothing but JVMs and they run map and reduce tasks (mapper and reducer classes), when the JVM is bootstrapped job resources that are on HDFS will be copied to the JVM. For mappers these JVMs will be created on same nodes on which data exists. Once the processing is started the jar file will be executed to process the data locally on that machine (typical).
To answer your question, reducer will be running on one or more data nodes as part of the containers. Java code will be copied as part of the bootstrap process (when JVM is created). Data will be fetched from mappers over the network.
No. Reduce functions are executed on data nodes.
Hadoop transfers packaged code (jar files) to the data node that are going to process data. At run time data nodes download these code and process task.

Which replica of input block is processed in mapper?

I create a simple I/O monitoring system for MapReduce jobs written in Java. So at Map stage of the job I want to log information about locations of processed data.
MapReduce job at Map stage process input split which consists of several file blocks in HDFS.
That blocks have several (usually 3) replicas.
Does it possible to know which replicas of these blocks have been used while reading in Mapper?
In other words can I get full path to the particular file in local file system from which Mapper reads?
In HDFS the blocks are replicated and the namenode do not have any information on which is replica. It uses a block to perform an operation based on network latency and load in that specific machine.
The file in HDFS is divided into blocks. The full path of file in hdfs is stored as namenode metadata. Each block is identified by a block id.
The value of the property dfs.namenode.name.dir in hdfs-site.xml gives the location on where all the blocks are stored.
Based on your requirement, if you want to get the path in local file system where the block is stored, read the value of this property, identify the block id by reading namenode metadata, then you'll be able to find the exact block in local file system that refers to a hdfs file programmatically.

Moving data to HDFS with Storm

Need some help to understand how HDFS and Storm are integrated. Storm can process incoming stream of data using many nodes. My data is, let's say, log entries from different machines. So how do I store that all? Ideally I'd like to store logs from one machine to a one or many files dedicated to that machine. However does does it work? Will I be able to append to the same file in HDFS from many different Storm nodes?
PS: I still working on getting all this running so I can't test this physically... but it does bother me.
Write a file in hdfs with Java
No, you cannot write to the same file from more than one task at a time. Each task would need to write to it's own file in a directory and then you could process them using directory/* if you are using hadoop

How to distribute the initial input files to nodes in Hadoop MapReduce?

I have a hadoop cluster with two computers, One as a master and another one as a slave. My input data is present on the Local disk of Master and I have also copied the input data files in the HDFS system. Now my question is, if I run the MapReduce task on this cluster then the whole input file is present on only one system [ which i think is opposed to the MapReduce's basic principle of "Data Locality" ]. I would like to know if there is any mechanism to distribute/partition the initial files so that the input files can be distributed on the different nodes of the cluster.
Lets say your cluster is composed of Node 1 and Node 2. If Node 1 is master, then there is no Datanode running on that node. So you have only a Datanode on Node 2, so I'm not sure what you mean when you say "so that the input files can be distributed on the different nodes of the cluster" because with your current setup, you just have 1 node on which data can be stored.
But if you consider a generic n node cluster, then if you copy data into HDFS, then data is distributed onto different nodes of the cluster by hadoop itself, so you wouldn't have to worry about that.

Access to HDFS files from all computers of a cluster

My hadoop the program originally was launched in a local mode, and now my purpose became start in completely distributed mode. For this purpose it is necessary to provide access to the files which reading is executed in the reducer and mapper functions, from all computers of a cluster and therefore I asked a question on http://answers.mapr.com/questions/4444/syntax-of-option-files-in-hadoop-script (also as it will be not known on what computer to be executed the mapper function (mapper from logic of the program there will be only one and the program will be launched only with one mapper), it is necessary to provide also access on all cluster to the file arriving on an input of the mapper function). In this regard I had a question: Whether it is possible to use hdfs-files directly: that is to copy beforehand files from file system of Linux in file system of HDFS (thereby as I assume, these files become available on all computers of a cluster if it not so, correct please) and then to use HDFS Java API for reading these files, in the reducer and mapper functions which are executing on computers of a cluster?
If on this question the response the positive, give please a copying example from file system of Linux in file system of HDFS and reading these files in java to the program by means of HDFS Java API and and record of its contents at java-string.
Copy all your input files to the master node (this can be done using scp).
Then login to your master node (ssh) and execute something like following to copy files from local filesystem to hdfs:
hadoop fs -put $localfilelocation $destination
Now in your hadoop jobs, you may use the input to be hdfs:///$destination. No need to use any extra API to read from HDFS.
If you really want to read files from HDFS and use as addiotional information other than the input files, then refer this.

Categories

Resources