I already installed hadoop mapreduce in one node and I have a top ten problem.
Let's say I have a 10k pair data (key,value) and search 10 data with the best value.
Actually, I create a simple project to iterate whole data and I need just a couple minute to got the answer.
then, I create mapreduce application with top ten design pattern to solve same problem, and I need more than 4 hour to get the answer. (obviously, I use the same machine and same algorithm to sort)
I think, that probably happens because mapreduce need more service to run, need more network activity, need more effort to read and write to hdfs.
Any other's factor to prove that mapreduce (in that condition) is slower than not using mapreduce?
mapreduce is slower on a single node setup because only one mapper and one reducer can work on it at any given time. mapper has to iterate through the each one of the splits and the reducer works on two mapper outputs simultaneously and then on two such reducer out puts ans so on..
so In terms of complexity:
for normal project :t(n) = n => O(n)
for mapreduce:t(n) = (n/x)*t(n/2x) => O((n/x)log(n/x)) where x is the number of nodes
which do you think is bigger? for single node and multinode..
explanation for mapreduce complexity:
time for one iteration: n
number of simultaneous map function: x since only one can work on each node
then time required for mapping complete data: n/x since n is the time 1 mapper takes for complete data
for reduce job half of the time is required as compared to the previous map since it works on two mapper outputs simultaneously therefore: time = n/2x for x reducers on x nodes
hence the equation that every next step will take half the time than the previous one.
t(n) = (n/x)*t(n/2x)
solving this recursion we get, O((n/x)log(n/x)).
this is not supposed to be exact but an approximation
Related
The problem:
We are given a set of n tasks, each having an integer start time and
end time. What is the maximum amount of tasks running in parallel at
any given time?
The algorithm should run in O(n log n) time.
This is a school assignment so i don't need a direct answer but any code snippets are welcome as long as they are in Java or Scala (assignment supposed to be written in scala.)
Some of the hints say that i should take advantage of Priority queues. I read the documentation, but I'm not really sure on how to use them, so any code snippets are welcome.
The input data could for instance be Array[Pair[Int,Int]] = Array((1000,2000),(1500,2200)) and so on.
I'm really struggling to set the Ordering of the priority queue, so if nothing else i hope someone could help me with that.
PS:
The priority queue is supposed to be initialized with PriorityQueue()(ord).
Edit: i came up with the solution using priority queues but thank you for all the answers. You guys helped me figure out the logic!
Soln without using Priority Queue.
Consider the array of tasks as follows:
[(1,2), (1,5), (2,4), ....] // (a,b) : (start_time, end_time)
Step 1 : Construct an array considering start_time and end_time together.
[1,2,1,5,2,4....]
Step 2 : Maintain another array to know whether the time at index i is start_time or end_time
[S,E,S,E,S,E...] // S:Start_Time, E:End_Time
Step 3 : Sort the first array. And make sure to change the index in another array accordingly.
Step 4 : Maintain two variables, parallel_ryt_now and max_parallel_till_now. And traverse the second array as follows:
for i in 1:len(second_array):
if(second_array[i] == "S"):
parallel_ryt_now ++
else
parallel_ryt_now --
if parallel_ryt_now > max_parallel_till_now:
max_parallel_till_now = parallel_ryt_now
Logic :
While traversing the sorted array, when u encounter a start_time, that means a task has started. Thus increment the the parallel_ryt_now and when u encounter an end_time, means that a task has completed, thus decrement the parallel_ryt_now.
This way, at every moment the parallel_ryt_now var stores the parallel running tasks.
Time Complexity = Sort + Traverse = O(nlogn) + O(n) = O(nlogn)
Space Complexity = O(n) (To store the extra array for info about whether time at index i is start_time or end_time )
I hope it helped.
I am able to do few preprocessing steps in datamining using Hadoop MapReduce.
One such is normalization.
say
100,1:2:3
101,2:3:4
into
100 1
100 2
100 3
101 2
101 3
101 4
Like wise am i able to do binning for a numerical data say iris.csv.
I worked out the maths behind it
Iris DataSet: http://archive.ics.uci.edu/ml/datasets/Iris
find out the minimum and maximum values of each attribute
in the data set.
Sepal Length |Sepal Width| Petal Length| Petal Width
Min | 4.3| 2.0 | 1.0| 0.1
Max | 7.9 | 4.4 |6.9 | 2.5
Then, we should divide the data values of each attributes into ānā buckets .
Say, n=5.
Bucket Width= (Max - Min) /n
Sepal Length= (7.9-4.3)/5= 0.72
So, the intervals will be as follows :
4.3 - 5.02
5.02 - 5.74
Likewise,
5.74 -6.46
6.46 - 7.18
7.18- 7.9
continue for all attributes
Are we able to do the same in Mapreduce .
Please Suggest.
I am not sure if I understood your question, but what you want to do is to get the maximum and minimum for each of the attributes of that dataset, to then divide them, all in the same job, right? Ok, in order to divide the attributes, you need to feed the reducer with the max and min values instead of relying on the reducer to do the work for you. And I am guessing this is where your trouble starts.
However there is one thing you could do, a MapReduce design pattern called in-mapper combiner. When each mapper has finished its job, it calls a method called cleanup. You can implement the cleanup method so that it gets the max and min values of each of the attributes for each of the map nodes. This way, you give the reducer (only one reducer) only a collection with X values, being X the number of mappers in your cluster.
Then, the reducer gets the max and min values for each of the attributes, since it will be a very short collection so there won't be any problems. Finally, you divide each of the attributes into the 'n' buckets.
There is plenty of information about this pattern on the web, an example could be this tutorial. Hope it helps.
EDIT: you need to create an instance variable in the mapper where you will store each of the values in the map method, so that they will be available in the cleanup method, since it's only called once. A HashMap for example will do. You need to remember that you cannot save the values in the context variable in the map method, you need to do this in the cleanup method, after iterating through the HashMap and finding out the max and min value for each column. Then, as for the key, I don't think it really matters in this case, so yes, you could use the csv header, and as for the value you are correct, you need to store the whole column.
Once the reducer receives the output from the mappers, you can't calculate the buckets just yet. Bear in mind that you will receive one "column" for each mapper, so if you have 20 mappers, you will receive 20 max values and 20 min values for each attribute. Therefore you need to calculate the max and min again, just like you did in the cleanup method of the mappers, and once this is done, then you can finally calculate the buckets.
You may be wondering "if I still need to find the max and min values in the reducer, then I could omit the cleanup method and do everything in the reducer, after all the code would be more or less the same". However, to do what you are asking, you can only work with one reducer, so if you omit the cleanup method and leave all the work to the reducer, the throughput would be the same as if working in one machine without Hadoop.
Do you know any parallel modified moving average algorithm?
I want quickly calculate moving average but not with sequential algorithms. I want use parallel algorithms but I have still not found solution.
The best algorithm which I found is sequential algorithm modified moving average for measuring computer performance:
new_avg = alfa(new_time, previous_time) * new_value + (1-alfa(new_time, previous_time)) * previous_avg
alfa(new_time, previous_time) = 1- exp(-(new_time - previous_time)/moving_period)
Some other algorithms are good also but I have not found parallel algorithms.
It is hard question and I need some help with it.
Consider that I want count events that will come in random time order - early events can come later that late events - you could assume that early event can be skipped/become obsolete after processing late events (or with some timeout). Not assume sequential time order of events and that event from same time will come with same time.
I do not want use any algorithm which require to remember many samples (especially all) it should only remember time and previous average value maybe some additional value but not all or same samples. Consider that algorithm can make some minor errors not need to be perfect if reason of it is some performance boost.
It will be very nice if it will use sharding but not required.
A moving average where events arrive in sequence could be done like this:
newMovingAverage = ((MovingAverage * (n - 1)) + newSample) / n
where n dictates how big (or little) influence this sample should have on the moving average. The greater the n, the smaller the influence. Over time, older samples will have less and less influence on the moving average as new samples arrive.
With samples coming out of sequence you can try to mimic that behavioral by letting the age of the sample dictate how much influence it should have on the moving average. This could e.g. be done like this:
influence = (1 + sampleAge)^2 * n
newMovingAverage = ((MovingAverage * (influence - 1)) + newSample) / influence
Where I let the sampleAge dictate how much the newSample should influence the moving average.
The possibility of having a parallel algorithm would depend on the nature of the moving average that you are using.
The algorithm that you show in your question is an exponential smoother. Thus, the first value of the data has an influence on every calculated average value. The amount of influence that the first value has decreases with every new data point, but even the last average in the sequence will be slightly influenced by the first data point.
This sort of moving average can't be parallelised because you can't calculate any average without using (explicitly or implicitly) all the previous data that has been received.
However, Wikipedia's article on moving averages nicely summarises a range of moving average methods, some of which are easily implemented in parallel.
For example, a simple moving average takes the following form (for odd n)**:
n2 = int(n/2)
moving_average[i] = (data[i-n2] + data[i-n2+1] ... +
data[i] + ... + data[i+n2-1] + data[i+n2])/n
This method doesn't make use of any data earlier than int(n/2) points before i to calculate the moving average at point i. Therefore, you could calculate the moving average of a data set of m items in parallel with p threads by dividing the m items into p sub-sequences, each of which overlaps the next and previous (excepting the first and last sub-sequences) sub-sequence by int(n/2) data points, and have each thread calculate the moving averages for its sub-sequence.
You can find an efficient sequential implementation of this algorithm (which would be applicable to each thread of the parallel implementation) in the question Simple Moving Average summation/offset issue and its answer. That method calculates a trailing moving average rather than the (arguably preferred) centrally-located moving average that I've shown above. That is, it puts the value that I calculated above at moving_average[i+n2] instead of at moving_average[i].
** This leaves aside the possibility that the data may be at irregular time intervals. The method you've shown addresses that issue and it can be dealt with the same way in other methods.
I am reading about MapReduce and the following thing is confusing me.
Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:
Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.
My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?
Thanks,
Chander
Check out merge-sort.
It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.
If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.
Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.
As others have mentioned, merging is much simpler than sorting, so there's a big win there.
However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.
One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.
So the simplest way to sort using map-reduce (though the not the most efficient one) is to do the following
During the Map Phase
(Input_Key, Input_Value) emit out (Input_Value,Input Key)
Reducer is an Identity Reduceer
So for example if our data is a student, age database then your mapper input would be
('A', 1) ('B',2) ('C', 10) ... and the output would be
(1, A) (2, B) (10, C)
Haven't tried this logic out but it is step in a homework problem I am working on. Will put an update source code/ logic link.
Sorry for being late but for future readers, yes, Chander, you are missing something.
Logic is that Reducer can handle shuffled and then sorted data of its node only on which it is running. I mean reducer that run at one node can't look at other node's data, it applies the reduce algorithm on its data only. So merging procedure of merge sort can't be applied.
So for big data we use TeraSort, which is nothing but identity mapper and reducer with custom partitioner. You can read more about it here Hadoop's implementation for TeraSort. It states:
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N ā 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i ā 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
I think, combining multiple sorted items is efficient than combining multiple unsorted items. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.
Sorting can be efficiently implemented using MapReduce. But you seem to be thinking about implementing merge-sort using mapreduce to achieve this purpose. It may not be the ideal candidate.
Like you alluded to, the mergesort (with map-reduce) would involve following steps:
Partition the elements into small groups and assign each group to the mappers in round robin manner
Each mapper will sort the subset and return {K, {subset}}, where K is same for all the mappers
Since same K is used across all mappers, only one reduce and hence only one reducer. The reducer can merge the data and return the sorted result
The problem here is that, like you mentioned, there can be only one reducer which precludes the parallelism during reduction phase. Like it was mentioned in other replies, mapreduce specific implementations like terasort can be considered for this purpose.
Found the explanation at http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf
Coming back to merge-sort, this would be feasible if the hadoop (or equivalent) tool provides hierarchy of reducers where output of one level of reducers goes to the next level of reducers or loop it back to the same set of reducers
I need an idea for an efficient index/search algorithm, and/or data structure, for determining whether a time-interval overlaps zero or more time-intervals in a list, keeping in mind that a complete overlap is a special case of partial overlap . So far I've not not come up with anything fast or elegant...
Consider a collection of intervals with each interval having 2 dates - start, and end.
Intervals can be large or small, they can overlap each other partially, or not at all. In Java notation, something like this:
interface Period
{
long getStart(); // millis since the epoch
long getEnd();
boolean intersects(Period p); // trivial intersection check with another period
}
Collection<Period> c = new ArrayList<Period>(); // assume a lot of elements
The goal is to efficiently find all intervals which partially intersect a newly-arrived input interval. For c as an ArrayList this could look like...
Collection<Period> getIntersectingPeriods(Period p)
{
// how to implement this without full iteration?
Collection<Period> result = new ArrayList<Period>();
for (Period element : c)
if (element.intersects(p))
result.add(element);
return result;
}
Iterating through the entire list linearly requires too many compares to meet my performance goals. Instead of ArrayList, something better is needed to direct the search, and minimize the number of comparisons.
My best solution so far involves maintaining two sorted lists internally and conducting 4 binary searches and some list iteration for every request. Any better ideas?
Editor's Note: Time-intervals are a specific case employing linear segments along a single axis, be that X, or in this case, T (for time).
Interval trees will do:
In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree...
Seems the Wiki article solves more than was asked. Are you tied to Java?
You have a "huge collection of objects" which says to me "Database"
You asked about "built-in period indexing capabilities" and indexing says database to me.
Only you can decide whether this SQL meets your perception of "elegant":
Select A.Key as One_Interval,
B.Key as Other_Interval
From Big_List_Of_Intervals as A join Big_List_Of_Intervals as B
on A.Start between B.Start and B.End OR
B.Start between A.Start and A.End
If the Start and End columns are indexed, a relational database (according to advertising) will be quite efficient at this.