Binning for numerical dataset in Hadoop MapReduce

Binning for numerical dataset in Hadoop MapReduce - java

I am able to do few preprocessing steps in datamining using Hadoop MapReduce.
One such is normalization.
say
100,1:2:3
101,2:3:4
into
100 1
100 2
100 3
101 2
101 3
101 4
Like wise am i able to do binning for a numerical data say iris.csv.
I worked out the maths behind it
Iris DataSet: http://archive.ics.uci.edu/ml/datasets/Iris
find out the minimum and maximum values of each attribute
in the data set.
Sepal Length |Sepal Width| Petal Length| Petal Width
Min | 4.3| 2.0 | 1.0| 0.1
Max | 7.9 | 4.4 |6.9 | 2.5
Then, we should divide the data values of each attributes into ‘n’ buckets .
Say, n=5.
Bucket Width= (Max - Min) /n
Sepal Length= (7.9-4.3)/5= 0.72
So, the intervals will be as follows :
4.3 - 5.02
5.02 - 5.74
Likewise,
5.74 -6.46
6.46 - 7.18
7.18- 7.9
continue for all attributes
Are we able to do the same in Mapreduce .
Please Suggest.

I am not sure if I understood your question, but what you want to do is to get the maximum and minimum for each of the attributes of that dataset, to then divide them, all in the same job, right? Ok, in order to divide the attributes, you need to feed the reducer with the max and min values instead of relying on the reducer to do the work for you. And I am guessing this is where your trouble starts.
However there is one thing you could do, a MapReduce design pattern called in-mapper combiner. When each mapper has finished its job, it calls a method called cleanup. You can implement the cleanup method so that it gets the max and min values of each of the attributes for each of the map nodes. This way, you give the reducer (only one reducer) only a collection with X values, being X the number of mappers in your cluster.
Then, the reducer gets the max and min values for each of the attributes, since it will be a very short collection so there won't be any problems. Finally, you divide each of the attributes into the 'n' buckets.
There is plenty of information about this pattern on the web, an example could be this tutorial. Hope it helps.
EDIT: you need to create an instance variable in the mapper where you will store each of the values in the map method, so that they will be available in the cleanup method, since it's only called once. A HashMap for example will do. You need to remember that you cannot save the values in the context variable in the map method, you need to do this in the cleanup method, after iterating through the HashMap and finding out the max and min value for each column. Then, as for the key, I don't think it really matters in this case, so yes, you could use the csv header, and as for the value you are correct, you need to store the whole column.
Once the reducer receives the output from the mappers, you can't calculate the buckets just yet. Bear in mind that you will receive one "column" for each mapper, so if you have 20 mappers, you will receive 20 max values and 20 min values for each attribute. Therefore you need to calculate the max and min again, just like you did in the cleanup method of the mappers, and once this is done, then you can finally calculate the buckets.
You may be wondering "if I still need to find the max and min values in the reducer, then I could omit the cleanup method and do everything in the reducer, after all the code would be more or less the same". However, to do what you are asking, you can only work with one reducer, so if you omit the cleanup method and leave all the work to the reducer, the throughput would be the same as if working in one machine without Hadoop.

Related

Pattern Databases Storing all permutations

I am looking for some advice on storing all possible permutations for the fringe pattern database.
So the fifteen tile problem has 16! possible permutations, however storing the values for fringe so the 0 (blank tile),3,7,11,12,13,14,15 is 16!/(16-8)! = 518,918,400 permutations.
I am looking to store all of these permutations in a datastructure along with the value of the heuristic function (which is just incremented each time a iteration of the breadth first search), so far I am doing so but very slowly and took me 5 minutes to store 60,000 which is time I don't have!
At the moment I have a structure which looks like this.
Value Pos0 Pos3 Pos7 Pos11 Pos12 Pos13 Pos14 Pos15
Where I store the position of the given numbers. I have to use these positions as the ID for when I am calculating the heuristic value I can quickly trawl through to the given composition and retrieve the value.
I am pretty unsure about this. The state of the puzzle is represented by an array example:
int[] goalState = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
My question is what would be the best data structure to store these values? and the best way to retrieve them.
(This question was originally based on storing in a database, but now I want to store them in some form of local data structure - as retrieving from a database slow )

I can't really grasp, what special meaning do 0,3,7,11,12,13,14,15 have in your case. Is their position unchangeable? Is their position enough to identify the whole puzzle state?
Anyway, here is a general approach, you can narrow it down anytime:
As you have 16 possible states at max, I would try to use hexadecimal numbers to represent your permutations. So the state {1,2,3,6,5,4,7,8,9,10,11,12,13,14,15,0} would look like 0x123654789ABCDEF0 = 1312329218393956080. The biggest number possible would be 0xFEDCBA9876543210, which still can be stored in an unsigned long (only since Java 8) or alternatively in BigInteger (there are many examples, I would prefer this). Such number would be unique for each permutation and could be used as primary key and if you have the whole state, retrieving it from the database would be pretty fast.
//saving your permutation
String state = "0xFEDCBA9876543210";
BigInteger permutationForDatabase = new BigInteger(state, 16);
//and then you can insert it into database as a number
//reading your permutation
char searchedCharacter = 'A';//lets say you look for tile 10
BigInteger permutation = ...;//here you read the number from the database
int tilePosition = permutation.toString(16).indexOf(searchedCharacter);
There might be a more elegant/performant solution to get the tile position (maybe some bit operation magic).

Each number 0-15 is a 4-bit number. You must represent 7 such numbers, making a minimum requirement of 28 bits, which is well within the 31 signed bit space of an int. Thus all permutations may be assigned, and derived from, an int.
To calculate this number, given variables a through g:
int key = a | (b << 4) | (c << 8) | (d << 12) | (e << 16) | (f << 20) | (g << 24);
To decode (if you need to):
int a = key & 0xF;
int b = key & 0xF0;
int c = key & 0xF00; // etc
Storing ints in a database is very efficient and will use minimal disk space:
create table heuristics (
key_value int not null,
heuristic varchar(32) not null -- as small as you can, char(n) if all the same length
);
After inserting all the rows, create a covering index for super fast lookup:
create unique index heuristics_covering heuristics(key_value, heuristic);
If you create this index before insertion, insertions will be very, very slow.
Creating the data and inserting it is relatively straightforward coding.

So is my understanding correct that you're calculating a heuristic value for each possible puzzle state, and you want to be able to look it up later based on a given puzzle state? So that you don't have to calculate it on the fly? Presumably because of the time it takes to calculate the heuristic value.
So you're iterating over all the possible puzzle states, calculating the heuristic, and then storing that result. And it's taking a long time to do that. It seems like your assumption is that it's taking a long time to store the value - but what if the time lag you're seeing isn't the time it's taking to store the values in the data store, but rather the time it's taking the generate the heuristic values? That seems far more likely to me.
In that case, if you want to speed up the process of generating and storing the values, I might suggest splitting up the task into sections, and using several threads at once.
The fasted data structure I believe is going to be an in-memory hash table, with the hash key being your puzzle state, and the value being your heuristic value. Others have already suggested reasonable ways of generating puzzle-state hash keys. The same hash table structure could be accessed by each of the threads which are generating and storing heuristic values for sections of the puzzle state domain.
Once you've populated the hash table, you can simply serialize it and store it in a binary file in the filesystem. Then have your heuristic value server load that into memory (and deserialize it into the in-memory hash table) when it starts up.
If my premise is incorrect that it's taking a long time to generate the heuristic values, then it seems like you're doing something grossly sub-optimal when you go to store them. For example reconnecting to a remote database each time you store a value. That could potentially explain the 5 minutes. And if you're reconnecting every time you go to look up a value, that could explain why that is taking too long, too.
Depending on how big your heuristic values are, an in memory hash table might not be practical. A random-access binary file of records (with each record simply containing the heuristic value) could accomplish the same thing, potentially, but you'd need some way of mathematically mapping the hash key domain to the record index domain (which consists of sequential integers). If you're iterating over all the possible puzzle states, it seems like you already have a way of mapping puzzle states to sequential integers; you just have to figure out the math.
Using a local database table with each row simply having a key and a value is not unreasonable. You should definitely be able to insert 518 million rows in the space of a few minutes - you just need to maintain a connection during the data loading process, and build your index after your data load is finished. Once you've built the index on your key, a look up using the (clustered primary key integer) index should be pretty quick as long as you don't have to reconnect for every look up.
Also if you're committing rows into a database, you don't want to commit after each row, you'll want to commit every 1,000 or 10,000 rows. If you're committing after each row is inserted, that will substantially degrade your data loading performance.

Compare MapReduce Performance

I already installed hadoop mapreduce in one node and I have a top ten problem.
Let's say I have a 10k pair data (key,value) and search 10 data with the best value.
Actually, I create a simple project to iterate whole data and I need just a couple minute to got the answer.
then, I create mapreduce application with top ten design pattern to solve same problem, and I need more than 4 hour to get the answer. (obviously, I use the same machine and same algorithm to sort)
I think, that probably happens because mapreduce need more service to run, need more network activity, need more effort to read and write to hdfs.
Any other's factor to prove that mapreduce (in that condition) is slower than not using mapreduce?

mapreduce is slower on a single node setup because only one mapper and one reducer can work on it at any given time. mapper has to iterate through the each one of the splits and the reducer works on two mapper outputs simultaneously and then on two such reducer out puts ans so on..
so In terms of complexity:
for normal project :t(n) = n => O(n)
for mapreduce:t(n) = (n/x)*t(n/2x) => O((n/x)log(n/x)) where x is the number of nodes
which do you think is bigger? for single node and multinode..
explanation for mapreduce complexity:
time for one iteration: n
number of simultaneous map function: x since only one can work on each node
then time required for mapping complete data: n/x since n is the time 1 mapper takes for complete data
for reduce job half of the time is required as compared to the previous map since it works on two mapper outputs simultaneously therefore: time = n/2x for x reducers on x nodes
hence the equation that every next step will take half the time than the previous one.
t(n) = (n/x)*t(n/2x)
solving this recursion we get, O((n/x)log(n/x)).
this is not supposed to be exact but an approximation

timerange in java

In Java is there any way to store time ranges as key in Hashmap? I have one HashMap and I store times time range. For example:
I enter 0-50 range as key and for that key I will store some other objects as value. Now when I say 10 I should be able to get the corresponding value for that key.
Any value between 0-50 should get that object.
Map map = new HashMap();
map.put(0-50,"some object")
map.put(51-100,"some other object")
now when I say map.get(10) it should be able to get the "some object". Please suggest how to do this?

I wouldn't use a map, instead I would try with a R-Tree. A R-tree is a tree structure created for indexing spatial data. It stores rectangles. It is often used to test if a point (coordinate) is lying within an other geometry. Those geometries are approximated by rectangles and those are stored in the tree.
To store a rectangle (or the information about it) you only need to save the lower left and upper right corner coordinates. In your case this would be the lower and upper bound of the time span. You can think of it, as if all y values of the coordinates were 0. Then you can query the tree with your time value.
And of course you would save the value at each leaf (time span/rectangle)
A simple search on google for r-tree java brought up some prominising results. Implementing your own R-tree isn't trivial, but it is not too complicated if you understood the principle of re-arranging the tree upon insertion/deletion. In your one dimensional case it might get even simpler.

Assumptions: Non-overlapping ranges.
You can store the starting point and ending point of the ranges in an TreeSet. The starting point and ending point are objects that store the starting time and ending time respectively, plus (a reference to) the object. You have to define comparison function, so that the objects are ordered by the time.
You can obtain the object by using floor() or ceiling() function of TreeSet.
Note that the ranges should NOT overlap, even at the endpoints (e.g. 3-6 and 6-10)
This will give you log complexity for range insertion and query.

If this is a non overlapping and equi distant range i.e. range split by say 50, you can solve this problem by maintaining hash for the max numbers like
50 - 'some object', 100 - 'some other object', etc..
if the input is 10, derive the immediate multiple of 50 and get the value for that key.
You can arrive at immediate multiple of 50
take mode on input say for the input 90 i.e. 90 % 50 = 40
compute diff of step 1 result with 50. i.e. 50 - 40 = 10
add step 2 result to input i.e. 90 + 10 = 100

You need to map the ranges to a single key, why dont you use something like a rangemanager object which returns for anyvalue between min and max the key 1 for example. alternativly you can put the someobject as a value for all keys between 1 and 50 using a for loop, but this would be a waste in my eyes.

hadoop MapReduce: find max key value pair from output of mapper

It sounds like a simple job, but with MapReduce it doesn't seem that straight-forward.
I have N files in which there is only one line of text for each file. I'd like the Mapper to output key value pairs like < filename, score >, in which 'score' is an integer calculated from the line of text. As a sidenote I am using the below snippet to do so (hope it's correct).
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();
Assuming the mapper does its job correctly, it should output N key value pairs.
Now the problem is how should I program the Reducer to output the one key value pair with the maximum 'score'?
From what I know Reducer only works with key value pairs that share the same key. Since the output in this scenario all have different keys, I am guessing something should be done before the Reduce step. Or perhaps should the Reduce step be omitted altogether?

Lets assume that
File1 has 10,123,23,233
File2 has 1,3,56,1234
File3 has 6,1,3435,678
Here is the approach for finding the maximum number from all the input files.
Lets first do some random sampling (like say every N records). From File1 123 and 10, from File2 56 and 1, from File3 1 and 678.
Pick the maximum number from the random sampling, which is 678.
Pass the maximum number from the random sampling to the mapper and ignore the input numbers less the maximum number found in the random sampling and emit the others in the mappers. Mappers will ignore anything less than 678 and emit 678, 1234 and 3435.
Configure the job to use 1 reducer and find the max of all the numbers sent to the reducer. In this scenario reducer will receive 678, 1234 and 3435. And will calculate the maximum number to be 3435.
Some observations of the above approach
The data has to be passed twice.
The data transferred between the mappers and reducers is decreased.
The data processed by the reducers also decreases.
Better the input sampling, faster the Job completes.
Combiner with similar functionality as the Reducer will further improve the Job time.

You can use the setup() and cleanup() methods (configure() and close() methods in old API).
Declare a global variable in reduce class, which determines the maximum score. For each call to reduce, you would compare the input value (score) with the global variable.
Setup() is called once before all reduce invocations in the same reduce task. Cleanup() is called after last reduce invocation in the same reduce task. So, if you have multiple reducers, Setup() and cleanup() methods would be called separately on each reduce task.

You can return the the filename and the score as the value and just return any constant as the key from your mapper

Refer slide 32 & 33 of http://www.slideshare.net/josem.alvarez/map-reduceintro
I used the same approach and got the result. Only concern is when you have multiple fields, you need to create fieldnamemin and fieldnamemax individually.

Omit the Reducer !!
Use the Configuration to set the global variable as score and key and then access it in the mapper to do a simple selection of max score by using the global variable as the memory of max score and key
It should be simple. I guess.

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.
Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:
Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.
My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?
Thanks,
Chander

Check out merge-sort.
It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.
If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.
Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.

As others have mentioned, merging is much simpler than sorting, so there's a big win there.
However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.
One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.

So the simplest way to sort using map-reduce (though the not the most efficient one) is to do the following
During the Map Phase
(Input_Key, Input_Value) emit out (Input_Value,Input Key)
Reducer is an Identity Reduceer
So for example if our data is a student, age database then your mapper input would be
('A', 1) ('B',2) ('C', 10) ... and the output would be
(1, A) (2, B) (10, C)
Haven't tried this logic out but it is step in a homework problem I am working on. Will put an update source code/ logic link.

Sorry for being late but for future readers, yes, Chander, you are missing something.
Logic is that Reducer can handle shuffled and then sorted data of its node only on which it is running. I mean reducer that run at one node can't look at other node's data, it applies the reduce algorithm on its data only. So merging procedure of merge sort can't be applied.
So for big data we use TeraSort, which is nothing but identity mapper and reducer with custom partitioner. You can read more about it here Hadoop's implementation for TeraSort. It states:
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

I think, combining multiple sorted items is efficient than combining multiple unsorted items. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.

Sorting can be efficiently implemented using MapReduce. But you seem to be thinking about implementing merge-sort using mapreduce to achieve this purpose. It may not be the ideal candidate.
Like you alluded to, the mergesort (with map-reduce) would involve following steps:
Partition the elements into small groups and assign each group to the mappers in round robin manner
Each mapper will sort the subset and return {K, {subset}}, where K is same for all the mappers
Since same K is used across all mappers, only one reduce and hence only one reducer. The reducer can merge the data and return the sorted result
The problem here is that, like you mentioned, there can be only one reducer which precludes the parallelism during reduction phase. Like it was mentioned in other replies, mapreduce specific implementations like terasort can be considered for this purpose.
Found the explanation at http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf
Coming back to merge-sort, this would be feasible if the hadoop (or equivalent) tool provides hierarchy of reducers where output of one level of reducers goes to the next level of reducers or loop it back to the same set of reducers

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.