MapReduce: Increase number of concurrent mapper tasks - java

I am using AWS EMR to run a map reduce job. My input set contains 1 million files each of around 15KB. Since input files are very small, so this will lead to a huge number of mappers. So, I changed s3 block size to 20KB and used 5 r3.2xlarge instances but number of concurrent tasks running is still just 30. Shouldn't the job run more number of concurrent mappers now after reducing the block size or even after reducing block size, memory taken by each mapper is still same?
How can I limit the memory usage of each mapper or increase the number of concurrent mapper tasks? The current expected completion time is 100hours, will combining these files to lesser number of bigger files, like 400MB files, increase the processing time?

Reducing Block size can increase the number of mappers required for a particular job , but will not increase the parallel number of mappers that your cluster can run at a given point nor the memory used for those mappers.
used 5 r3.2xlarge instances but number of concurrent tasks running is
still just 30
To find the parallel maps/Reducers that a Hadoop 2 EMR cluster can support , please see this article AWS EMR Parallel Mappers?
Ex: r3.2xlarge * 5 core's :
mapreduce.map.memory.mb 3392 3392
yarn.scheduler.maximum-allocation-mb 54272
yarn.nodemanager.resource.memory-mb 54272
Once core-node can have 54272/3392 = 16 mappers .
So, a cluster can have a total of 16*5 = 80 mappers in parallel.
So , if your job spins up like 1000 mappers , cluster can launch 80 mappers with that preconfigured memory and heap on your nodes and other mappers will be simply Queued up.
If you want more parallel mappers, you might want to configure less memory (based on that math) and less heap for mapper.

What you are looking for is CombineFileInputFormat .
Do remember map slit size by default = HDFS block size by default. Changing one will not affect the other.
Please follow the link : http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

Related

How to optimize Mapreduce Job

So I have a job that does in mapper computing. With each task taking about .08 seconds, a 360026 line file will take about 8 hours to just do this. If it was done on one node. File sizes will generally be about the size of 1-2 block sizes (often 200 MB or less).
Assuming the in code is optimized, is there anyway to mess with the settings? Should I be using a smaller block size for example? I currently am using AWS EMR, with c4.large instances and autoscaling on YARN, but it only went up to 4 extra task nodes, as the load wasn't too high. Even though YARN memory wasn't too high, it still took over 7 hours to complete (which is way to long).

How to configure jobTracker in hazelcast.xml to get optimized performance?

Below is default configuration in hazelcast.xml,
<jobtracker name="default">
<max-thread-size>0</max-thread-size>
<!-- Queue size 0 means number of partitions * 2 -->
<queue-size>0</queue-size>
<retry-count>0</retry-count>
<chunk-size>1000</chunk-size>
<communicate-stats>true</communicate-stats>
<topology-changed-strategy>CANCEL_RUNNING_OPERATION</topology-changed-strategy>
</jobtracker>
How to update this configuration to get better performance for map reducing in java application???
The values you normally want to optimize are chunk-size and communicate-stats. First property is heavily depending on the way your mr job works and needs some trial and error, best is to keep reducers busy all the time (so depending on the reducing operation either bigger chunk size for heavy ops or smaller chunks for light operations). The communicate-stats deactivates transmission of statistical information which is normally not being used anyways.

can map and reduce jobs be on different machines?

i'm working or a very distinct solution on computational offloading, i can do that very well with a custom programming in c++/java but i'm in a search of same can be done in hadoop or any other framework ? i searched a lot but nothing worthy i found about that.
As we know a normal hadoop job made with Map and Reduce phase where both are running on machine which are having almost same power, for map phase we dont need the power and that can be offloaded to a cheap commodity hardware like RaspberryPI, while reduce should run on strong machine.
so is it possible to isolate these 2 phases and make them machine aware ?
On each node you can create a mapred-site.xml file to override any default settings. These settings will then only apply to this node (task tracker).
For each node can then specify values for
mapreduce.tasktracker.reduce.tasks.maximum
mapreduce.tasktracker.map.tasks.maximum
On nodes where you only want to run reduce tasks set the maximum map tasks to 0 and the other way around.
Here is the list of configuration options
Reducer jobs can run on different node but what is the advantage in running Reducer job on powerful machine?
You can use same commodity hardware configuration for both Map and Reduce nodes.
Fine tuning Map reduce job is trickier part depending on
1) Your input size
2) Time taken for Mapper to complete the Map job
3) Setting number of Map & Reducer jobs
etc.
Apart from config changes suggested by Gerhard, Have a look at some of the tips for fine tuning the performance Job
Tips to Tune the number of map and reduce tasks appropriately
Diagnostics/symptoms:
1) Each map or reduce task finishes in less than 30-40 seconds.
2) A large job does not utilize all available slots in the cluster.
3) After most mappers or reducers are scheduled, one or two remains pending and then runs all alone.
Tuning the number of map and reduce tasks for a job is important. Some tips.
1) If each task takes less than 30-40 seconds, reduce the number of tasks.
2) If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
3) So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster.
4) Don’t schedule too many reduce tasks – for most jobs. Number of reduce tasks should be equal to or a bit less than the number of reduce slots in the cluster.
If you still want to have different configuration, have a look at this question and Wiki link
EDIT:
Configure mapred.map.tasks in 1.x (or mapreduce.job.maps in 2.x version) & mapred.reduce.tasks in 1.x (or mapreduce.job.reduces in 2.x version) accordingly in your nodes depending on hardware configuration. Configure more reducers in better hardware nodes. But before configuring these parameters, make sure that you have taken care of INPUT size, Map processing time etc

How can I read counter (for example, the number of output records) of each reduce task

I am running iterative hadoop/mapreduce jobs to analyze certain data.
(apache hadoop version 1.1.0)
and I need to know the number of output records of each reduce task to run the next iteration of M/R job.
I can read the consolidated counter after each M/R job but I cannot find the way to read counter of each task separately.
Please advise me regarding this.
Choi
That's not how counters work: each task reports its metrics to a central point, so there is no way of knowing the counter values from individual tasks.
From here: http://www.thecloudavenue.com/2011/11/retrieving-hadoop-counters-in-mapreduce.html
Counters can be incremented using the Reporter for the Old MapReduce API or by using the Context using the New MapReduce API. These counters are sent to the TaskTracker and the TaskTracker will send to the JobTracker and the JobTracker will consolidate the Counters to produce a holistic view for the complete Job. The consolidated Counters are not relayed back to the Map and the Reduce tasks by the JobTracker. So, the Map and Reduce tasks have to contact the JobTracker to get the current value of the Counter.
I suppose you could create a task-specific counter (prefix the counter name, for example) but you would then end up with a lot of different counters, and, as they are designed to be light-weight, you might run into problems (although the threshold level is fairly high: I once tested the limit and the node crashed when I reached something like a million counters!)

Setting the number of hadoop tasks/node

I am running a Hadoop job on a cluster that is shared by several of our applications. We have about 40 nodes and 4 mapper slots/node. Whenever my job (which is nothing but mapper) runs, it takes up all the 160 slots and blocks other jobs from running. I have tried to set the property from within the job "mapred.tasktracker.map.tasks.maximum=1" and also "mapred.map.tasks" to 30 (to limit it to only 30 nodes) from with the task code.
conf.setInt ( "mapred.tasktracker.map.tasks.maximum", 1 );
conf.setInt ( "mapred.map.tasks", 30 );
conf.setBoolean ( "mapred.map.tasks.speculative.execution", false );
I have 2 questions:
a. When the job runs, the job.xml reflects the "mapred.tasktracker.map.tasks.maximum=1", but the job still ends up taking 160 slots.
b. The mapred.map.tasks in the job.xml is not 30. It is still a big number (like 800).
Any help would be appreciated.
I've found it's best to control the maximum number of mappers by setting the input files' block size when moving data into HDFS. For example, if you set the block size to 1/30 of the total size, you'll end up with 30 blocks, and therefore, a maximum of 30 map tasks.
hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location
We can specify max and min map tasks for job but hadoop dosent guarantee its execution like it does for reducers. Hadoop uses the min and max map task values to estimate and does its best to keep number of tasks close to it.You should use a scheduler like fair scheduler in cluster for your problem. Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time.
you can not limit the number of the mapper number.
The mapper number is counted by your data size and the block size.If your data is very large,you only can increase you block size to reduce the mapper number.
Because if you limit the number,the mapper will block for waiting for the end of all other mappers .

Categories

Resources