I/O time when reading from HDFS in Hadoop

I/O time when reading from HDFS in Hadoop - java

I would like to measure the time taken for map and reduce when performing I/O (reading from HDFS) in Hadoop. I am using Yarn.
Hadoop 2.6.0.
What are the options for that?

If you need exact measurments - you could use btrace, add it as a javaagent to your tasks via mapreduce.{map,reduce}.java.opts - and then write script which measures whatever you like. Sample of btrace scripts are here.
Also there is HTrace - that might also be helpful.

One rough estimation could be creating custom counters. For both mapper and reducer you could collect the timestamp when mapper(or reducer) starts processing and when it ends. From starting and ending timestamp, calculate and add it to custom counters, i.e mappers add to MAPPER_RUNNING_TIME and reducers add to REDUCER_RUNNING_TIME (or whatever name you would like to give it). When the execution is finished, subtract the aggregated value of your counters from MILLIS_MAPS and MILLIS_REDUCES respectively. You might need to look into Hadoop source code though to confirm if the staging time is or is not being included into MILLIS_MAPS and MILLIS_REDUCES. With this estimation you would need to take into account that the tasks are being executed concurrently, so the time will be rather total (or aggregated for all mappers and reducers).
I have not done this personally, but I think this solution could work unless you find better one.

Related

Modify mapper size (split size) in MapReduce to have a faster performance

Are there any ways to improve the MapReduce performance by changing the number of map tasks or changing the split sizes of each mapper?
For example, I have a 100GB text file and 20 nodes. I want to run a WordCount job on the text file, what is the ideal number of mappers or the ideal split size so that it can be done faster?
Would it be faster with more mappers?
Would it be faster with a smaller split size?
EDIT
I am using hadoop 2.7.1, just so you know there is YARN.

It is not necessarily faster when you use more mappers. Each mapper has a start up and setup time. In the early days of hadoop when mapreduce was the de facto standard it was said that a mapper should run ~10 minutes. Today the documentations recommends 1 minute. You can vary the number of map tasks by using setNumMapTasks(int) which you can define within the JobConf. IN the documentation of the method are very good information about the mapper count:
How many maps?
The number of maps is usually driven by the total size
of the inputs i.e. total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps
per-node, although it has been set up to 300 or so for very cpu-light
map tasks. Task setup takes awhile, so it is best if the maps take at
least a minute to execute.
The default behavior of file-based InputFormats is to split the input
into logical InputSplits based on the total size, in bytes, of input
files. However, the FileSystem blocksize of the input files is treated
as an upper bound for input splits. A lower bound on the split size
can be set via mapreduce.input.fileinputformat.split.minsize.
Thus, if you expect 10TB of input data and have a blocksize of 128MB,
you'll end up with 82,000 maps, unless setNumMapTasks(int) is used to
set it even higher.
Your question is probably related to this SO question.
To be honest, try to have a look at modern frameworks as well, like Apache Spark and Apache Flink.

Set the number of map tasks

While configuring a Map Reduce job, I know that one can set the number of reduce tasks by using the method job.setNumReduceTasks(2);.
Can we set the number of map tasks?
I don't see any methods to do this.
If there is no such functionality, does someone know why this framework has the ability to have more than 1 reduce task, but not more than 1 map task?

There used to be property for setting the number of map tasks, which was setNumMapTasks. Bur it was merely a hint to the framework, and could not guarantee that you'll get only the specified number of maps. The map creation is actually governed by the InputFormat you are using in your job. And this is the reason it is not supported anymore.
If you are not happy with the number of mappers created by the framework, you could try tweaking the values of following 2 properties as per your requirements :
- mapred.min.split.size
- mapred.max.split.size

Number of map tasks is not something the programmer sets,rather its something that the hadoop framework,in particular the TaskTracker that creates as many mappers as the number of input splits(generally of 64mb but can be changed) of the InputFile from HDFS...

Processing different files on separate nodes using Hadoop MapReduce

I have used Pig and Hive before but am new to Hadoop MapReduce. I need to write an application which has multiple small sized files as input (say 10). They have different file structures, so I want to process them parallelly on separate nodes so that they can be processed quickly. I know that the strong point of Hadoop is processing large data but these input files, though small, require a lot of processing so I was hoping to leverage Hadoop's parallel computing prowess. Is this possible?

It is possible but you're probably not going to get much value. You have these forces against you:
Confused input
You'll need to write a mapper which can handle all of the different input formats (either by detecting the input format, or using the filename of the input to decide which format to expect)
Multiple outputs
You need to either use the slightly tricky multiple output file handling functionality of Hadoop or write your output as a side effect of the reducer (or mapper if you can be sure that each file will go to a different node)
High Cost of initialization
Every hadoop map reduce job comes with a hefty start up cost, about 30 seconds on a small cluster, much more on a larger cluster. This point alone probably will lose you more time than you could ever hope to gain by parallelism.

In brief: give a try to NLineInputFormat.
There is no problem to copy all your input files to all nodes (you can put them to distributed cache if you like). What you really want to distribute is check processing.
With Hadoop you can create (single!) input control file in the format (filename,check2run) or (filename,format,check2run) and use NLineInputFormat to feed specified number of checks to your nodes (mapreduce.input.lineinputformat.linespermap controls number of lines feed to each mapper).
Note: Hadoop input format determines how splits are calculated; NLineInputFormat (unlike TextInputFormat) does not care about blocks.
Depending on the nature of your checks you may be able to compute linespermap value to cover all files/checks in one wave of mappers (or may be unable to use this approach at all :) )

Hadoop: Chaining jobs in 0.20.203

I currently have a task where i need to chain a few jobs in Hadoop.
What i am dong right now is that i have 2 jobs. My first job has a map function,a combiner and a reducer. Well i need one more phase of reduce so i created a second job with a simple map task that passes the output of the previous reducer to the final reducer.
I find that this is a bit "stupid" because there has to be a way to simply chain this. Moreover i think the I/Os would be decreased that way.
I am using the 0.20.203 version and i only find deprecated examples of ChainMapper and ChainReducer using JobConf.
I have found these:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainMapper.html
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainReducer.html
that seems to work with Job class and not with the JobConf which is deprecated in 203, but there isn't any package that contains these classes in 203.

You can consider using oozie. Creating a workflow would be much easier.

How to combine multiple Hadoop MapReduce Jobs into one?

I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.
My goal: Compute these different tasks as fast as possible.
I currently let them run sequentially each reading in all the data. I assume it will be faster when combining the tasks and executing their similar parts (like feeding all data to the mapper) only once.
I was wondering if and how I can combine these tasks. For every input key/value pair the mapper could emit a "super key" that includes a task id and the task specific key data along with a value. This way reducers would get key/value pairs for a task and a task-specific key and could decide when seeing the "superkey" which task to perform on the included key and values.
In pseudo code:
map(key, value):
emit(SuperKey("Task 1", IncludedKey), value)
emit(SuperKey("Task 2", AnotherIncludedKey), value)
reduce(key, values):
if key.taskid == "Task 1":
for value in values:
// do stuff with key.includedkey and value
else:
// do something else
The key could be a WritableComparable which can include all the necessary information.
Note: the pseudo code suggests a terrible architecture and it can definitely be done in a smarter way.
My questions are:
Is this a sensible approach?
Are there better alternatives?
Does it have some terrible drawback?
Would I need a custom Partitioner class for this approach?
Context: The data consists of some millions of RDF quadruples and the tasks are to calculate clusters, statistics and similarities. Some tasks can be solved easily with just Hadoop Counters in a reducer, but some need multiple MapReduce steps.
The computation will eventually take place on Amazon's Elastic MapReduce. All tasks are to be computed on the whole dataset and as fast as possible.

Is this a sensible approach?
There's nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs' logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).
Are there better alternatives?
It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you've already thought of this).
Does it have some terrible drawback?
The only thing I can think of is that if one of the tasks' map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it's hard to say whether this would matter much. The same would hold for the reducers.
Would I need a custom Partitioner class for this approach?
Probably, depending on what you're doing. I think in general if you're writing a custom output WritableComparable, you'll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).
HTH. If you can give a little more context, maybe I could offer more advice. Good luck!

You can use:
Cascading
Oozie
Both are used to write workflows in hadoop.

I think Oozie is the best option for this. Its a workflow scheduler, where you can combine multiple hadoop jobs, where the output of one action node will be the input to the next action node. And if any of the action fails, then next time when u execute it again ,the scheduler starts from the point where the error was encountered.
http://www.infoq.com/articles/introductionOozie

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.