Reducer outputting value for two different keys in same file

Reducer outputting value for two different keys in same file - java

Hi I have written a mapreduce job which is generically parsing XML file. I am able to parse an XML file and getting all key value pair generated properly.I am having 6 different keys and there corresponding values. So I am running 6 different reducers in parallel.
Now problem I am facing is reducer is putting two different key - value pair in same file and remaining 4 key-value in individual files. So in short out of 6 files in output from reducer I am getting 4 files with single-key value pair and 1 file with two key-value pair and 1 file having nothing.
I tried doing research on Google and various forums only thing I concluded is I need a partitioner to solve this problem. I am new hadoop and so can someone put some light on this issue and help me solve this.
I am working on a pseudo-node cluster and using Java as a programming language. I am not able to share code here but still try to describe problem in brief.
Let me know more information is needed and thanks in advance.

Having only 6 keys for 6 reducers isn't the best utilisation of hadoop - while it would be nice for each of the 6 to go to a separate reducer it isn't guaranteed.
Keys cannot get split across reducers, so if you were to have less than 6 keys only a subset of your reducers would have any work to do. You should consider rethinking your key assignment (and perhaps input files' appropriate-ness for hadoop) and perhaps use a system such that there are enough keys to be spread somewhat evenly amongst the reducers.
EDIT: I believe what you might be after is MultipleOutputFormat, which has the method generateFileNameForKeyValue(key, value, name), allowing you to generate a file to write out to per key rather than just one file per Reducer.

Hadoop by default uses a default Hash partitioner - click here, which is something like
public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {
public void configure(JobConf job) {}
/** Use {#link Object#hashCode()} to partition. */
public int getPartition(K2 key, V2 value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
The key.hashCode() & Integer.MAX_VALUE) % numReduceTasks will return a number between 0 to numReduceTasks and in your case the range would be 0 to 5, since, numRuduceTask=6
The catch is there in that line itself - two such statement may return you the same number.
And, as a result two different keys could go to the same reducer.
For e.g.-
("go".hashCode() & Integer.MAX_VALUE) % 6
will return you 4 and,
("hello".hashCode() & Integer.MAX_VALUE) % 6
will also return you 4.
So, what I would suggest here is that if you want to be sure that all your 6 keys get processed by 6 different reducers, you need to create your own partitioner to get what you desire.
Check out this link for creating a custom partitioner, if you have any confusion and you specify your custom partitioner something like the following using the Job class.
job.setPartitioner(<YourPartionerHere.class>);
Hope this helps.

Related

Understanding JavaPairRDD.reduceByKey function

I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?

As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.

I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)

What determines the number of reducers and how to avoid bottlenecks regarding reducers?

Suppose I have a big tsv file with this kind of information:
2012-09-22 00:00:01.0 249342258346881024 47268866 0 0 0 bo
2012-09-22 00:00:02.0 249342260934746115 1344951 0 0 4 ot
2012-09-22 00:00:02.0 249342261098336257 346095334 1 0 0 ot
2012-09-22 00:05:02.0 249342261500977152 254785340 0 1 0 ot
I want to implement a MapReduce job that enumerates time intervals of five minutes and filter some information of the tsv inputs. The output file would look like this:
0 47268866 bo
0 134495 ot
0 346095334 ot
1 254785340 ot
The key is the number of the interval, e.g., 0 is the reference of the interval between 2012-09-22 00:00:00.0 to 2012-09-22 00:04:59.
I don't know if this problem doesn't fit on MapReduce approach or if I'm not thinking it right. In the map function, I'm just passing the timestamp as key and the filtered information as value. In the reduce function, I count the intervals by using global variables and produce the output mentioned.
i. Does the framework determine the number of reducers in some automatically way or it is user defined? With one reducer, I think that there is no problem on my approach, but I'm wondering if one reduce can become a bottleneck when dealing with really large files, can it?
ii. How can I solve this problem with multiple reducers?
Any suggestions would be really appreciated!
Thanks in advance!
EDIT:
The first question is answered by #Olaf, but the second still gives me some doubts regarding parallelism. The map output of my map function is currently this (I'm just passing the timestamp with minute precision):
2012-09-22 00:00 47268866 bo
2012-09-22 00:00 344951 ot
2012-09-22 00:00 346095334 ot
2012-09-22 00:05 254785340 ot
So in the reduce function I receive inputs that the key represents the minute when the information was collected and the values the information itself and I want to enumerate five minutes intervals beginning with 0. I'm currently using a global variable to store the beginning of the interval and when a key extrapolate it I'm incrementing the interval counter (That is also a global variable).
Here is the code:
private long stepRange = TimeUnit.MINUTES.toMillis(5);
private long stepInitialMillis = 0;
private int stepCounter = 0;
#Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
long millis = Long.valueOf(key.toString());
if (stepInitialMillis == 0) {
stepInitialMillis = millis;
} else {
if (millis - stepInitialMillis > stepRange) {
stepCounter = stepCounter + 1;
stepInitialMillis = millis;
}
}
for (Text value : values) {
context.write(new Text(String.valueOf(stepCounter)),
new Text(key.toString() + "\t" + value));
}
}
So, with multiple reducers, I will have my reduce function running on two or more nodes, in two or more JVMs and I will lose the control given by the global variables and I'm not thinking of a workaround for my case.

The number of reducers depends on the configuration of the cluster, although you can limit the number of reducers used by your MapReduce job.
A single reducer would indeed become a bottleneck in your MapReduce job if you are dealing with any significant amount of data.
Hadoop MapReduce engine gurantees that all values associated with the same key are sent to the same reducer, so your approach should work with multile reducers. See Yahoo! tutorial for details: http://developer.yahoo.com/hadoop/tutorial/module4.html#listreducing
EDIT: To guarantee that all values for the same time interval go to the same reducer, you would have to use some unique identifier of the time interval as the key. You would have to do it in the mapper. I'm reading your question again and, unless you want to somehow aggregate the data between the records corresponding to the same time interval, you don't need any reducer at all.
EDIT: As #SeanOwen pointed, the number of reducers depends on the configuration of the cluster. Usually it is configured between 0.95 and 1.75 times the number of maximum tasks per node times the number of data nodes. If the mapred.reduce.tasks value is not set in the cluster configuration, the default number of reducers is 1.

It looks like you're wanting to aggregate some data by five-minute blocks. Map-reduce with Hadoop works great for this sort of thing! There should be no reason to use any "global variables". Here is how I would set it up:
The mapper reads one line of the TSV. It grabs the timestamp, and computes which five-minute bucket it belongs in. Make that into a string, and emit it as the key, something like "20120922:0000", "20120922:0005", "20120922:0010", etc. As for the value that is emitted along with that key, just keep it simple to start with, and send on the whole tab-delimited line as another Text object.
Now that the mapper has determined how the data needs to be organized, it's the reducer's job to do the aggregation. Each reducer will get a key (one of the five-minute buckers), along with the list of all the lines that fit into that bucket. It can iterate over that list, and extract whatever it wants from it, writing output to the context as needed.
As for mappers, just let hadoop figure that part out. Set the number of reducers to how many nodes you have in the cluster, as a starting point. Should run just fine.
Hope this helps.

Mapreduce output number of files in HDFS according to reduce task or reduce method call

Just for learning I tried to modify the word count example and added a partiotiner. I understood the part that by writing the customized partiotiner we can control the number of Reduce Task so getting created. This is good.
But one question I am not able to understood is number of output files so generated in hdfs so that depends on number of Reduce Task so called or number of Reduce calls so done for each Reduce task.
(For each Reduce Task there can be many reduce calls happening).
Let me know if any other detail is needed. Code is very basic so not posting it.

I think your perception that writing the customized partitioner can control the number of Reduce Task getting created is wrong. Please check the following explanation:-
Actually paritioner determines in which reducer to send the key and list of values based of the hash value of the key as explained below.
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
Now the question of number of output files generated depends on the number of reduce task which you have asked the job to run. So in case say you configured 3 reduce task for the job, and say you wrote a custom partitioner which lead to sending the keys into only 2 reducers. In this case you will find empty part-r00002 output file for the third reducer as it did not get any records after partitioning. This empty part file can be removed using LazyOutputFormat.
Ex: import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
I hope this clears your doubt.

Construct document-term matrix via Java and MapReduce

Background:
I’m trying to make a “document-term” matrix in Java on Hadoop using MapReduce. A document-term matrix is like a huge table where each row represents a document and each column represents a possible word/term.
Problem Statement:
Assuming that I already have a term index list (so that I know which term is associated with which column number), what is the best way to look up the index for each term in each document so that I can build the matrix row-by-row (i.e., document-by-document)?
So far I can think of two approaches:
Approach #1:
Store the term index list on the Hadoop distributed file system. Each time a mapper reads a new document for indexing, spawn a new MapReduce job -- one job for each unique word in the document, where each job queries the distributed terms list for its term. This approach sounds like overkill, since I am guessing there is some overhead associated with starting up a new job, and since this approach might call for tens of millions of jobs. Also, I’m not sure if it’s possible to call a MapReduce job within another MapReduce job.
Approach #2:
Append the term index list to each document so that each mapper ends up with a local copy of the term index list. This approach is pretty wasteful with storage (there will be as many copies of the term index list as there are documents). Also, I’m not sure how to merge the term index list with each document -- would I merge them in a mapper or in a reducer?
Question Update 1
Input File Format:
The input file will be a CSV (comma separated value) file containing all of the documents (product reviews). There is no column header in the file, but the values for each review appear in the following order: product_id, review_id, review, stars. Below is a fake example:
“Product A”, “1”,“Product A is very, very expensive.”,”2”
“Product G”, ”2”, “Awesome product!!”, “5”
Term Index File Format:
Each line in the term index file consists of the following: an index number, a tab, and then a word. Each possible word is listed only once in the index file, such that the term index file is analogous to what could be a list of primary keys (the words) for an SQL table. For each word in a particular document, my tentative plan is to iterate through each line of the term index file until I find the word. The column number for that word is then defined as the column/term index associated with that word. Below is an example of the term index file, which was constructed using the two example product reviews mentioned earlier.
1 awesome
2 product
3 a
4 is
5 very
6 expensive
Output File Format:
I would like the output to be in the “Matrix Market” (MM) format, which is the industry standard for compressing matrices with many zeros. This is the ideal format because most reviews will contain only a small proportion of all possible words, so for a particular document it is only necessary to specify the non-zero columns.
The first row in the MM format has three tab separated values: the total number of documents, the total number of word columns, and the total number of lines in the MM file excluding the header. After the header, each additional row contains the matrix coordinates associated with a particular entry, and the value of the entry, in this order: reviewID, wordColumnID, entry (how many times this word appears in the review). For more details on the Matrix Market format, see this link: http://math.nist.gov/MatrixMarket/formats.html.
Each review’s ID will equal its row index in the document-term matrix. This way I can preserve the review’s ID in the Matrix Market format so that I can still associate each review with its star rating. My ultimate goal -- which is beyond the scope of this question -- is to build a natural language processing algorithm to predict the number of stars in a new review based on its text.
Using the example above, the final output file would look like this (I can't get Stackoverflow to show tabs instead of spaces):
2 6 7
1 2 1
1 3 1
1 4 1
1 5 2
1 6 1
2 1 1
2 2 1

Well, you can use something analogous to a inverted index concept.
I'm suggesting this becaue, I'm assuming both the files are big. Hence, comparing each other like one-to-one would be real performance bottle neck.
Here's a way that can be used -
You can feed both the Input File Format csv file(s) (say, datafile1, datafile2) and the term index file (say, term_index_file) as input to your job.
Then in each mapper, you filter the source file name, something like this -
Pseudo code for mapper -
map(key, row, context){
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
if (filename.startsWith("datafile") {
//split the review_id, words from row
....
context.write(new Text("word), new Text("-1 | review_id"));
} else if(filename.startsWith("term_index_file") {
//split index and word
....
context.write(new Text("word"), new Text("index | 0"));
}
}
e.g. output from different mappers
Key Value source
product -1|1 datafile
very 5|0 term_index_file
very -1|1 datafile
product -1|2 datafile
very -1|1 datafile
product 2|0 term_index_file
...
...
Explanation (the example):
As it clearly shows the key will be your word and the value will be made of two parts separated by a delimiter "|"
If the source is a datafile then you emit key=product and value=-1|1, where -1 is a dummy element and 1 is a review_id.
If the source is a term_index_file then you emit key=product and value=2|0, where 2 is a index of word 'product' and 0 is a dummy review_id, which we would use for sorting- explained later.
Definitely, no duplicate index will be processed by two different mappers if we are providing the term_index_file as a normal input file to the job.
So, 'product, vary' or any other indexed word in the term_index_file will only be available to one mapper. Note this is only valid for term_index_file not the datafile.
Next step:
Hadoop mapreduce framework, as you might well know, will group by keys
So, you will have something like this going to different reducers,
reduce-1: key=product, value=<-1|1, -1|2, 2|0>
reduce-2: key=very, value=<5|0, -1|1, -1|1>
But, we have a problem in the above case. We would want a sort in the values after '|' i.e. in the reduce-1 -> 2|0, -1|1, -1|2 and in reduce-2 -> <5|0, -1|1, -1|1>
To achieve that you can use a secondary sort implemented using a sort comparator. Please google for this but here's a link that might help. Mentioning it here can go real lengthy.
In each reduce-1, since the values are sorted as above, when we begin iteration, we would get the '0' in the first iteration and with it the index_id=2, which could then be used for subsequent iterations. In the next two iteration, we get review ids 1 and 2 consecutively, and we use a counter, so that we could keep track of any repeated review ids. When we get repeated review ids that would mean that a word appeared twice in the same review_id row. We reset the counter only when we find a different review_id and emit the previous review_id details for the particular index_id, something like this -
previous_review_id + "\t" + index_id + "\t" + count
When the loop ends, we'll be left with a single previous_review_id, which we finally emit in the same fashion.
Pseudo code for reducer -
reduce(key, Iterable values, context) {
String index_id = null;
count = 1;
String previousReview_id = null;
for(value: values) {
Split split[] = values.split("\\|");
....
//when consecutive review_ids are same, we increment count
//and as soon as the review_id differ, we emit, reset the counter and print
//the previous review_id detected.
if (split[0].equals("-1") && split[1].equals(previousReview_id)) {
count++;
} else if(split[0].equals("-1") && !split[1].equals(prevValue)) {
context.write(previousReview_id + "\t" + index_id + "\t" + count);
previousReview_id = split[1];//resting with new review_id id
count=1;//resetting count for new review_id
} else {
index_id = split[0];
}
}
//the last previousReview_id will be left out,
//so, writing it now after the loop completion
context.write(previousReview_id + "\t" + index_id + "\t" + count);
}
This job is done with multiple reducers in order to leverage Hadoop for what it best known for - performance, as a result, the final output will be scattered, something like the following, deviating from your desired output.
1 4 1
2 1 1
1 5 2
1 2 1
1 3 1
1 6 1
2 2 1
But, if you want everything to be sorted according to the review_id (as your desired outpout), you can write one more job that will do that for your using a single reducer and the output of the previos job as input. And also at the same time calculate 2 6 7 and put it at the front of the output.
This is just an approach ( or an idea), I think, that might help you. You definitely want to modify this, put a better algorithm and use it the your way that you think would benefit you.
You can also use Composite keys for better clarity than using a delimiter such as "|".
I am open for any clarification. Please ask if you think, it might be useful to you.
Thank you!

You can load the term index list in Hadoop distributed cache so that it is available to mappers and reducers. For instance, in Hadoop streaming, you can run your job as follows:
$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapper.py \
-reducer myReducer.py \
-file myMapper.py \
-file myReducer.py \
-file myTermIndexList.txt
Now in myMapper.py you can load the file myTermIndexList.txt and use it to your purpose. If you give a more detailed description of your input and desired output I can give you more details.

Approach #1 is not good but very common if you don't have much hadoop experience. Starting jobs is very expensive. What you are going to want to do is have 2-3 jobs that feed each other to get the desired result. A common solution to similar problems is to have the mapper tokenize the input and output pairs, group them in the reducer executing some kind of calculation and then feed that into job 2. In the mapper in job 2 you invert the data in some way and in the reducer do some other calculation.
I would highly recommend learning more about Hadoop through a training course. Interestingly Cloudera's dev course has a very similar problem to the one you are trying to address. Alternatively or perhaps in addition to a course I would look at "Data-Intensive Text Processing with MapReduce" specifically the sections on "COMPUTING RELATIVE FREQUENCIES" and "Inverted Indexing for Text Retrieval"
http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

Hadoop and MapReduce, How do I send the equivalent of an array of lines pulled from a csv to the map function, where each array contained lines x - y;

Okay, so I have been reading a lot about Hadoop and MapReduce, and maybe it’s because I’m not as familiar with iterators as most, but I have a question I can’t seem to find a direct answer too. Basically, as I understand it, the map function is executed in parallel by many machine and/or cores. Thus, whatever you are working on must not depend on prior code being executed for the program to make any kind of speed gains. This works perfectly for me, but what I’m doing requires me to test information in small batches. Basically I need to send batches of lines in a .csv as arrays of 32, 64, 128 or whatever lines each. Like lines 0 – 127 go to core1’s execution of the map function, lines 128 – 255 lines go to core2’s, etc., .etc . Also I need to have the contents of each batch available as a whole inside the function, as if I had passed it an array. I read a little about how the new java API allows for something called push and pull, and that this allows things to be sent in batches, but I couldn’t find any example code. I dunno, I’m going to continue researching, and I’ll post anything I find, but if anyone knows, could they please post in this thread. I would really appreciate any help I might receive.
edit
If you could simply ensure that the chunks of the .csv are sent in sequence you could preform it this way. I guess this also assumes that there are globals in mapreduce.
//** concept not code **//
GLOBAL_COUNTER = 0;
GLOBAL_ARRAY = NEW ARRAY();
map()
{
GLOBAL_ARRAY[GLOBAL_COUNTER] = ITERATOR_VALUE;
GLOBAL_COUNTER++;
if(GLOBAL_COUNTER == 127)
{
//EXECUTE TEST WITH AN ARRAY OF 128 VALUES FOR COMPARISON
GLOBAL_COUNTER = 0;
}
}

If you're trying to get a chunk of lines from your CSV file into the mapper, you might consider writing your own InputFormat/RecordReader and potentially your own WritableComparable object. With the custom InputFormat/RecordReader you'll be able to specify how objects are created and passed to the mapper based on the input you receive.
If the mapper is doing what you want, but you need these chunks of lines sent to the reducer, make the output key for the mapper the same for each line you want in the same reduce function.
The default TextInputFormat will give input to your mapper like this (the keys/offsets in this example are just random numbers):
0 Hello World
123 My name is Sam
456 Foo bar bar foo
Each of those lines will be read into your mapper as a key,value pair. Just modify the key to be the same for each line you need and write it to the output:
0 Hello World
0 My name is Sam
1 Foo bar bar foo
The first time the reduce function is read, it will receive a key,value pair with the key being "0" and the value being an Iterable object containing "Hello World" and "My name is Sam". You'll be able to access both of these values in the same reduce method call by using the Iterable object.
Here is some pseudo code:
int count = 0
map (key, value) {
int newKey = count/2
context.write(newKey,value)
count++
}
reduce (key, values) {
for value in values
// Do something to each line
}
Hope that helps. :)

If the end goal of what you want is to force certain sets to go to certain machines for processing you want to look into writing your own Partitioner. Otherwise, Hadoop will split data automatically for you depending on the number of reducers.
I suggest reading the tutorial on the Hadoop site to get a better understanding of M/R.

If you simply want to send N lines of input to a single mapper, you can user the NLineInputFormat class. You could then do the line parsing (splitting on commas, etc) in the mapper.
If you want to have access to the lines before and after the line the mapper is currently processing, you may have to write your own input format. Subclassing FileInputFormat is usually a good place to start. You could create an InputFormat that reads N lines, concatenates them, and sends them as one block to a mapper, which then splits the input into N lines again and begins processing.
As far as globals in Hadoop go, you can specify some custom parameters when you create the job configuration, but as far as I know, you cannot change them in a worker and expect the change to propagate throughout the cluster. To set a job parameter that will be visible to workers, do the following where you are creating the job:
job.getConfiguration().set(Constants.SOME_PARAM, "my value");
Then to read the parameters value in the mapper or reducer,
public void map(Text key, Text value, Context context) {
Configuration conf = context.getConfiguration();
String someParam = conf.get(Constants.SOME_PARAM);
// use someParam in processing input
}
Hadoop has support for basic types such as int, long, string, bool, etc to be used in parameters.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.