Hadoop - aggregating by prefix - java

I have words with prefix. eg:
city|new york
city|London
travel|yes
...
city|new york
I want to count how many city|new york and city|London(which is classic wordcount). But, the reducer output should be a key-val pair like city:{"new york" :2, "london":1}. Meaning for each city prefix, I want to aggregate all the Strings and their counts.
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
// Instead of just result count, I need something like {"city":{"new york" :2, "london":1}}
context.write(key, result);
}
Any ideas?

You can use cleanup() method of the reducer to achieve this (assuming, you have just one reducer). It is called once at the end of the reduce task.
I will explain this for "city" data.
Following is the code:
package com.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
public class Cities {
public static class CityMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text outKey = new Text();
private IntWritable outValue = new IntWritable(1);
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
outKey.set(value);
context.write(outKey, outValue);
}
}
public static class CityReducer
extends Reducer<Text,IntWritable,Text,Text> {
HashMap<String, Integer> cityCount = new HashMap<String, Integer>();
public void reduce(Text key, Iterable<IntWritable>values,
Context context
) throws IOException, InterruptedException {
for (IntWritable val : values) {
String keyStr = key.toString();
if(keyStr.toLowerCase().startsWith("city|")) {
String[] tokens = keyStr.split("\\|");
if(cityCount.containsKey(tokens[1])) {
int count = cityCount.get(tokens[1]);
cityCount.put(tokens[1], ++count);
}
else
cityCount.put(tokens[1], val.get());
}
}
}
#Override
public void cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException,
InterruptedException
{
String output = "{\"city\":{";
Iterator iterator = cityCount.entrySet().iterator();
while(iterator.hasNext())
{
Map.Entry entry = (Map.Entry) iterator.next();
output = output.concat("\"" + entry.getKey() + "\":" + Integer.toString((Integer) entry.getValue()) + ", ");
}
output = output.substring(0, output.length() - 2);
output = output.concat("}}");
context.write(output, "");
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "KeyValue");
job.setJarByClass(Cities.class);
job.setMapperClass(CityMapper.class);
job.setReducerClass(CityReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/in/in.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
}
Mapper:
It just outputs count for each key it encounters. For e.g. if it encounters record "city|new york", then it will output (key, value) as ("city|new york", 1)
Reducer:
For each record, it checks if the key contains "city|". It splits the key on pipe ("|"). And stores the count for each city in a HashMap.
Reducer also overrides cleanup method. This method gets called once the reduce task is over. In this task, the contents of the HashMap are composed into the desired output.
In the cleanup(), the key is output as the contents of HashMap and value is output as empty string.
For e.g. I took the following data as input:
city|new york
city|London
city|new york
city|new york
city|Paris
city|Paris
I got the following output:
{"city":{"London":1, "new york":3, "Paris":2}}

It's simple.
Emit from mapper using the "city" as output key and the whole record as output value.
U will get city partitioned as a single group in a reducer and travel as another group.
Count the city and the travel instances using and hash map to grain down to lower levels.

Related

Hadoop chaining job error expected org.apache.hadoop.io.DoubleWritable, received org.apache.hadoop.io.LongWritable

I'm learning hadoop and I am trying to reproduce a job chain example.
This one sums the sales per video games.
Then the 2nd mapper is there only to swap key and value so it gets ordered by sales amount and not the title
I am getting this error on the second mapper:
Type mismatch in key from map: expected org.apache.hadoop.io.DoubleWritable, received org.apache.hadoop.io.LongWritable
I am confused on where is the LongWritable is coming from since its only Text and DoubleWritable everywhere, what am I missing?
//1st mapper
public class MonMap extends Mapper<Object, Text, Text, DoubleWritable>{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = value.toString();
// regex to avoid split on a comma between double quotes
String [] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
String jeux = tokens[1];
Double sales = Double.parseDouble(tokens[10]);
context.write(new Text(jeux), new DoubleWritable(sales));
}
}
// reducer
public class MonReduce extends Reducer<Text,DoubleWritable,Text,DoubleWritable> {
public void reduce(Text key, Iterable<DoubleWritable> values,
Context context
) throws IOException, InterruptedException {
Double somme = 0.0 ;
for (DoubleWritable val : values){
somme += val.get();
}
context.write(key, new DoubleWritable(somme));
}
}
//2nd mapper
public class KeyValueSwapper extends Mapper<Text, DoubleWritable, DoubleWritable, Text>{
public void map(Text key, DoubleWritable value, Context context
) throws IOException, InterruptedException {
context.write(value, key);
}
}
//main
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(MonMap.class);
job.setCombinerClass(MonReduce.class);
job.setReducerClass(MonReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
if (!job.waitForCompletion(true)) {
System.exit(1);
}
Job job2 = Job.getInstance(conf, "sort by sales");
job2.setJarByClass(WordCount.class);
job2.setMapperClass(KeyValueSwapper.class);
job2.setOutputKeyClass(DoubleWritable.class);
job2.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job2, new Path(args[1]));
FileOutputFormat.setOutputPath(job2, new Path(args[2]));
if (!job2.waitForCompletion(true)) {
System.exit(1);
}
}
Thank you!
Your mappers are both using FileInputFormat, which requires reading <LongWritable, Text inputs
That being said, you'll need to read, split and parse the lines of text in the second mapper
Overall I'd recommend using Pig, Hive, Spark, or Flink... Or just Pandas if you don't need Hadoop. Not many people use plain mapreduce in my experience

How to get a simple DAG to work in Hazelcast Jet?

While working on my DAG in hazelcast Jet, I stumbled into a weird problem. To check for the error I dumbed down my approach completely and: it seems that the edges are not working according to the tutorial.
The code below is almost as simple as it gets. Two vertices (one source, one sink), one edge.
The source is reading from a map, the sink should put into a map.
The data.addEntryListener correctly tells me that the map is filled with 100 lists (each with 25 objects at 400 byte) by another application ... and then nothing. The map fills up, but the dag doesn't interact with it at all.
Any idea where to look for the problem?
package be.andersch.clusterbench;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.hazelcast.config.Config;
import com.hazelcast.config.SerializerConfig;
import com.hazelcast.core.EntryEvent;
import com.hazelcast.jet.*;
import com.hazelcast.jet.config.JetConfig;
import com.hazelcast.jet.stream.IStreamMap;
import com.hazelcast.map.listener.EntryAddedListener;
import be.andersch.anotherpackage.myObject;
import java.util.List;
import java.util.concurrent.ExecutionException;
import static com.hazelcast.jet.Edge.between;
import static com.hazelcast.jet.Processors.*;
/**
* Created by abernard on 24.03.2017.
*/
public class Analyzer {
private static final ObjectMapper mapper = new ObjectMapper();
private static JetInstance jet;
private static final IStreamMap<Long, List<String>> data;
private static final IStreamMap<Long, List<String>> testmap;
static {
JetConfig config = new JetConfig();
Config hazelConfig = config.getHazelcastConfig();
hazelConfig.getGroupConfig().setName( "name" ).setPassword( "password" );
hazelConfig.getNetworkConfig().getInterfaces().setEnabled( true ).addInterface( "my_IP_range_here" );
hazelConfig.getSerializationConfig().getSerializerConfigs().add(
new SerializerConfig().
setTypeClass(myObject.class).
setImplementation(new OsamKryoSerializer()));
jet = Jet.newJetInstance(config);
data = jet.getMap("data");
testmap = jet.getMap("testmap");
}
public static void main(String[] args) throws ExecutionException, InterruptedException {
DAG dag = new DAG();
Vertex source = dag.newVertex("source", readMap("data"));
Vertex test = dag.newVertex("test", writeMap("testmap"));
dag.edge(between(source, test));
jet.newJob(dag).execute()get();
data.addEntryListener((EntryAddedListener<Long, List<String>>) (EntryEvent<Long, List<String>> entryEvent) -> {
System.out.println("Got data: " + entryEvent.getKey() + " at " + System.currentTimeMillis() + ", Size: " + jet.getHazelcastInstance().getMap("data").size());
}, true);
testmap.addEntryListener((EntryAddedListener<Long, List<String>>) (EntryEvent<Long, List<String>> entryEvent) -> {
System.out.println("Got test: " + entryEvent.getKey() + " at " + System.currentTimeMillis());
}, true);
Runtime.getRuntime().addShutdownHook(new Thread(() -> Jet.shutdownAll()));
}
}
The Jet job is already finished at the line jet.newJob(dag).execute().get(), before you even created the entry listeners. This means that the job runs on an empty map. Maybe your confusion is about the nature of this job: it's a batch job, not an infinite stream processing one. Jet version 0.3 does not yet support infinite stream processing.

Java Hadoop MapReduce Multiple Value

I was trying to do a movie recommendation system and have been following this website. LinkHere
def count_ratings_users_freq(self, user_id, values):
"""
For each user, emit a row containing their "postings"
(item,rating pairs)
Also emit user rating sum and count for use later steps.
output:
userid, number of movie rated by user, rating number count, (movieid, movie rating)
17 1,3,(70,3)
35 1,1,(21,1)
49 3,7,(19,2 21,1 70,4)
87 2,3,(19,1 21,2)
98 1,2,(19,2)
"""
item_count = 0
item_sum = 0
final = []
for item_id, rating in values:
item_count += 1
item_sum += rating
final.append((item_id, rating))
yield user_id, (item_count, item_sum, final)
Is it possible to translate the above code to Java with Hadoop Map and Reduce?
userid as key
no. movie rated by user, rating number count, (movieid, movie ratings) as value.
Thank you!
Yes, you can convert this into a map reduce program.
The mapper logic:
Assuming that input will be of format (user ID, movie ID, movie rating) (for e.g. 17,70,3), you can split each line on comma (,) and emit "user ID" as key and (movie ID, movie rating) as value. For e.g. for the record: (17,70,3), you can emit key: (17) and value: (70,3)
The reducer logic:
You will keep 3 variables: movieCount (integer), movieRatingCount (integer), movieValues (string).
For each value, you need parse the value and get the "movie rating". For e.g for value (70,3), you will parse the movie rating = 3.
For each valid record, you will increment movieCount. You will add the parsed "movie rating" to "movieRatingCount" and append the value to "movieValues" string.
You will get the desired output.
Following is the piece of code, which achieves this:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MovieRatings {
public static class MovieRatingsMapper
extends Mapper<LongWritable, Text , IntWritable, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String valueStr = value.toString();
int index = valueStr.indexOf(',');
if(index != -1) {
try
{
IntWritable keyUserID = new IntWritable(Integer.parseInt(valueStr.substring(0, index)));
context.write(keyUserID, new Text(valueStr.substring(index + 1)));
}
catch(Exception e)
{
// You could get a NumberFormatException
}
}
}
}
public static class MovieRatingsReducer
extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int movieCount = 0;
int movieRatingCount = 0;
String movieValues = "";
for (Text value : values) {
String[] tokens = value.toString().split(",");
if(tokens.length == 2)
{
movieRatingCount += Integer.parseInt(tokens[1].trim()); // You could get a NumberFormatException
movieCount++;
movieValues = movieValues.concat(value.toString() + " ");
}
}
context.write(key, new Text(Integer.toString(movieCount) + "," + Integer.toString(movieRatingCount) + ",(" + movieValues.trim() + ")"));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "CompositeKeyExample");
job.setJarByClass(MovieRatings.class);
job.setMapperClass(MovieRatingsMapper.class);
job.setReducerClass(MovieRatingsReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("/in/in2.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
}
For the input:
17,70,3
35,21,1
49,19,2
49,21,1
49,70,4
87,19,1
87,21,2
98,19,2
I got the output:
17 1,3,(70,3)
35 1,1,(21,1)
49 3,7,(70,4 21,1 19,2)
87 2,3,(21,2 19,1)
98 1,2,(19,2)

Retrieving nth qualifier in hbase using java

This question is quite out of box but i need it.
In list(collection), we can retrieve the nth element in the list by list.get(i);
similarly is there any method, in hbase, using java API, where i can get the nth qualifier given the row id and ColumnFamily name.
NOTE: I have million qualifiers in single row in single columnFamily.
Sorry for being unresponsive. Busy with something important. Try this for right now :
package org.myorg.hbasedemo;
import java.io.IOException;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
public class GetNthColunm {
public static void main(String[] args) throws IOException {
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "TEST");
Get g = new Get(Bytes.toBytes("4"));
Result r = table.get(g);
System.out.println("Enter column index :");
Scanner reader = new Scanner(System.in);
int index = reader.nextInt();
System.out.println("index : " + index);
int count = 0;
for (KeyValue kv : r.raw()) {
if(++count!=index)
continue;
System.out.println("Qualifier : "
+ Bytes.toString(kv.getQualifier()));
System.out.println("Value : " + Bytes.toString(kv.getValue()));
}
table.close();
System.out.println("Done.");
}
}
Will let you know if I get a better way to do this.

How to sort a list of the most frequently repeated words through Hadoop mapreduce WordCount? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Hi I am a newbie to hadoop mapreduce.
Could anyone of you help me to modify the below posted code to display the desired output?
I've a given input file as
Input: Hi my name is John.Im doing my engineering.My parents stay at California
I get the output as
Hi 1
my 3
name 1
is 1
is 1
John 1
doing 1
engineering 1
parents 1
stay 1
at 1
California 1
But I want the output to be sorted as
my 3
Hi 1
etc.....
then all the others to be displayed. The concept is to display the words that are repeated maximum times should be sorted and displayed first.
I'm running this job on a Single node. And I'm running this job as
$ hadoop jar job.jar input output
And i've started
$ hadoop namenode -format
$ hadoop namenode
$ hadoop datanode
sbin$ ./yarn-daemon.sh start resourcemanager
sbin$ ./yarn-daemon.sh start resourcemanager
I'm running hadoop-2.0.0-cdh4.0.0
package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.rg.apache.hadoop.fs.Path;
import oapache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
private static final Log LOG = LogFactory.getLog(WordCount.class);
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
//printKeyAndValues(key, values);
for (IntWritable val : values) {
sum += val.get();
LOG.info("val = " + val.get());
}
LOG.info("sum = " + sum + " key = " + key);
result.set(sum);
context.write(key, result);
//System.err.println(String.format("[reduce] word: (%s), count: (%d)", key, result.get()));
}
// a little method to print debug output
private void printKeyAndValues(Text key, Iterable<IntWritable> values)
{
StringBuilder sb = new StringBuilder();
for (IntWritable val : values)
{
sb.append(val.get() + ", ");
}
System.err.println(String.format("[reduce] key: (%s), value: (%s)", key, sb.toString()));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I would be great if anyone could sort this think out.
How about decreasing the count each time you find a word? Starting from 0 you will have -ve count of numbers.Highest count should come first then.

Categories

Resources