I have a massive amount of input data (that's why I use Hadoop) and there are multiple tasks that can be solved with various MapReduce steps of which the first mapper needs all the data as input.
My goal: Compute these different tasks as fast as possible.
I currently let them run sequentially each reading in all the data. I assume it will be faster when combining the tasks and executing their similar parts (like feeding all data to the mapper) only once.
I was wondering if and how I can combine these tasks. For every input key/value pair the mapper could emit a "super key" that includes a task id and the task specific key data along with a value. This way reducers would get key/value pairs for a task and a task-specific key and could decide when seeing the "superkey" which task to perform on the included key and values.
In pseudo code:
map(key, value):
emit(SuperKey("Task 1", IncludedKey), value)
emit(SuperKey("Task 2", AnotherIncludedKey), value)
reduce(key, values):
if key.taskid == "Task 1":
for value in values:
// do stuff with key.includedkey and value
else:
// do something else
The key could be a WritableComparable which can include all the necessary information.
Note: the pseudo code suggests a terrible architecture and it can definitely be done in a smarter way.
My questions are:
Is this a sensible approach?
Are there better alternatives?
Does it have some terrible drawback?
Would I need a custom Partitioner class for this approach?
Context: The data consists of some millions of RDF quadruples and the tasks are to calculate clusters, statistics and similarities. Some tasks can be solved easily with just Hadoop Counters in a reducer, but some need multiple MapReduce steps.
The computation will eventually take place on Amazon's Elastic MapReduce. All tasks are to be computed on the whole dataset and as fast as possible.
Is this a sensible approach?
There's nothing inherently wrong with it, other than the coupling of the maintenance of the different jobs' logic. I believe it will save you on some disk I/O, which could be a win if your disk is a bottleneck for your process (on small clusters this can be the case).
Are there better alternatives?
It may be prudent to write a somewhat framework-y Mapper and Reducer which each accept as configuration parameters references to the classes to which they should defer for the actual mapping and reducing. This may solve the aforementioned coupling of the code (maybe you've already thought of this).
Does it have some terrible drawback?
The only thing I can think of is that if one of the tasks' map logic fails to complete its work in a timely manner, the scheduler may fire up another node to process that piece of input data; this could result in duplicate work, but without knowing more about your process, it's hard to say whether this would matter much. The same would hold for the reducers.
Would I need a custom Partitioner class for this approach?
Probably, depending on what you're doing. I think in general if you're writing a custom output WritableComparable, you'll need custom partitioning as well. There may be some library Partitioner that could be configurable for your needs, though (such as KeyFieldBasedPartitioner, if you make your output of type Text and using String field-separators instead of rolling your own).
HTH. If you can give a little more context, maybe I could offer more advice. Good luck!
You can use:
Cascading
Oozie
Both are used to write workflows in hadoop.
I think Oozie is the best option for this. Its a workflow scheduler, where you can combine multiple hadoop jobs, where the output of one action node will be the input to the next action node. And if any of the action fails, then next time when u execute it again ,the scheduler starts from the point where the error was encountered.
http://www.infoq.com/articles/introductionOozie
Related
I would like to measure the time taken for map and reduce when performing I/O (reading from HDFS) in Hadoop. I am using Yarn.
Hadoop 2.6.0.
What are the options for that?
If you need exact measurments - you could use btrace, add it as a javaagent to your tasks via mapreduce.{map,reduce}.java.opts - and then write script which measures whatever you like. Sample of btrace scripts are here.
Also there is HTrace - that might also be helpful.
One rough estimation could be creating custom counters. For both mapper and reducer you could collect the timestamp when mapper(or reducer) starts processing and when it ends. From starting and ending timestamp, calculate and add it to custom counters, i.e mappers add to MAPPER_RUNNING_TIME and reducers add to REDUCER_RUNNING_TIME (or whatever name you would like to give it). When the execution is finished, subtract the aggregated value of your counters from MILLIS_MAPS and MILLIS_REDUCES respectively. You might need to look into Hadoop source code though to confirm if the staging time is or is not being included into MILLIS_MAPS and MILLIS_REDUCES. With this estimation you would need to take into account that the tasks are being executed concurrently, so the time will be rather total (or aggregated for all mappers and reducers).
I have not done this personally, but I think this solution could work unless you find better one.
I want to implement a word length program that categorize words in 4 categories on large corpus by using local aggregation methods but I don't have deepest knowledge about how these methods work. Because I am so new in MapReduce field. For example what is the sharpest differences between combiner and in-mapper combiner? In addition I should add a combiner and in-mapper combiner to my code and should measure the differences between them. But I don't have any idea where I should start, if someone help me, I appreciate it.
Implementing an in-map combiner (as best-described here) is the process of writing code within the scope of a map() method which stores multiple key-value pairs and performs some kind of aggregation function before outputting. This is different from typical map() methods which tend to deal with only a single key-value pair at once. This is quite risky as the developer is required to be very careful with memory allocation.
In-map combiners are typically used for ranking lists - i.e. an ArrayList is used to store the X highest-scoring entries to the mapper, and are output once all key-value pairs have entered the mapper. There's obviously little risk of running out of memory (unless X or the key or value are very large), and so lots of data can be immediately discarded.
Alternately, regular combiners are basically reducers that are executed immediately after a map phase finishes, and on the same node. The advantage is that the developer doesn't have to worry about implementing their own groupings (unlike the in-map combiner), and therefore memory issues are less-likely. The main disadvantage is that you can't guarantee that a combiner will run.
Regular combiners are used often for things such as counts - the WordCount with a combiner (such as this is the classic example.
For your case, I would always look to a regular combiner. Let it do all the work of grouping your categories, and avoid worrying about memory.
I've been looking into replacing our Oracle database of currently-executing commands with a Hazelcast distributed map implementation. To do this, I need to replace our SQL queries with the Hazelcast equivalent. Hazelcast provides some built-in aggregations, such as a count. I've been happily using this, but when I came to writing my own aggregations, I had a look at the source code for the CountAggregation. It can be found here: http://grepcode.com/file/repo1.maven.org/maven2/com.hazelcast/hazelcast/3.3-RC2/com/hazelcast/mapreduce/aggregation/impl/CountAggregation.java
Aggregations in Hazelcast are implemented using the MapReduce algorithm. But to me, the source above seems to be really inefficient. For the Mapper stage of the algorithm, they use a SupplierConsumingMapper, which simply emits mappings using the same key as the supplied key. What this then means is that the reducing stage doesn't actually reduce anything, because all of the emitted keys are different, and you end up with a whole load of 1's to count up at the final collation stage, rather than a number of partial counts to add together.
Surely what they should be doing is using a mapper which always emits the same key? Then the combiners and reducers could actually do some combining and reducing. It seems to me that the source code above is incorrectly using the MapReduce model, although the result you end up with is correct. Have I misunderstood something?
Hey you're absolutely correct. The implementation is a bit to simple at that place :) Can you please file an issue at github so we won't forget to fix that one. Thanks Chris
I have used Pig and Hive before but am new to Hadoop MapReduce. I need to write an application which has multiple small sized files as input (say 10). They have different file structures, so I want to process them parallelly on separate nodes so that they can be processed quickly. I know that the strong point of Hadoop is processing large data but these input files, though small, require a lot of processing so I was hoping to leverage Hadoop's parallel computing prowess. Is this possible?
It is possible but you're probably not going to get much value. You have these forces against you:
Confused input
You'll need to write a mapper which can handle all of the different input formats (either by detecting the input format, or using the filename of the input to decide which format to expect)
Multiple outputs
You need to either use the slightly tricky multiple output file handling functionality of Hadoop or write your output as a side effect of the reducer (or mapper if you can be sure that each file will go to a different node)
High Cost of initialization
Every hadoop map reduce job comes with a hefty start up cost, about 30 seconds on a small cluster, much more on a larger cluster. This point alone probably will lose you more time than you could ever hope to gain by parallelism.
In brief: give a try to NLineInputFormat.
There is no problem to copy all your input files to all nodes (you can put them to distributed cache if you like). What you really want to distribute is check processing.
With Hadoop you can create (single!) input control file in the format (filename,check2run) or (filename,format,check2run) and use NLineInputFormat to feed specified number of checks to your nodes (mapreduce.input.lineinputformat.linespermap controls number of lines feed to each mapper).
Note: Hadoop input format determines how splits are calculated; NLineInputFormat (unlike TextInputFormat) does not care about blocks.
Depending on the nature of your checks you may be able to compute linespermap value to cover all files/checks in one wave of mappers (or may be unable to use this approach at all :) )
I have an list of list whose indices reaches upto 100's of millions.Lets say each od the list inside list is an sentence of a text. I would like to partition this data for processing in different threads. I used subList to split
data and send it in different threads for processing. Is this a standard approach for paritioning data? If not , could you please suggest me some standard approch for it?
This will work as long as you do not "structurally modify" the list or any of these sub-lists. Read-only processing is fine.
There are many other "big data" approaches to handling 100s of millions of records, because there are other problems you might hit:
If your program fails (e.g. OutOfMemoryError), you probably don't want to have to start over from the beginning.
You might want to throw >1 machine at the problem, at which point you can't share the data within a single JVM's memory.
After you've processed each sentence, are you building some intermediate result and then processing that as a step 2? You may need to put together a pipeline of steps where you re-partition the data before each step.
You might find you have too many sentences to fit them all into memory at once.
A really common tool for this kind of work is Hadoop. You'd copy the data into HDFS, run a map-reduce job (or more than one job) on the data and then copy the data out of HDFS when you're done.
A simpler approach to implement is just to use a database and assign different ranges for the integer sentence_id column to different threads and build your output in another table.