Java Hadoop MapReduce Multiple Value - java

I was trying to do a movie recommendation system and have been following this website. LinkHere
def count_ratings_users_freq(self, user_id, values):
"""
For each user, emit a row containing their "postings"
(item,rating pairs)
Also emit user rating sum and count for use later steps.
output:
userid, number of movie rated by user, rating number count, (movieid, movie rating)
17 1,3,(70,3)
35 1,1,(21,1)
49 3,7,(19,2 21,1 70,4)
87 2,3,(19,1 21,2)
98 1,2,(19,2)
"""
item_count = 0
item_sum = 0
final = []
for item_id, rating in values:
item_count += 1
item_sum += rating
final.append((item_id, rating))
yield user_id, (item_count, item_sum, final)
Is it possible to translate the above code to Java with Hadoop Map and Reduce?
userid as key
no. movie rated by user, rating number count, (movieid, movie ratings) as value.
Thank you!

Yes, you can convert this into a map reduce program.
The mapper logic:
Assuming that input will be of format (user ID, movie ID, movie rating) (for e.g. 17,70,3), you can split each line on comma (,) and emit "user ID" as key and (movie ID, movie rating) as value. For e.g. for the record: (17,70,3), you can emit key: (17) and value: (70,3)
The reducer logic:
You will keep 3 variables: movieCount (integer), movieRatingCount (integer), movieValues (string).
For each value, you need parse the value and get the "movie rating". For e.g for value (70,3), you will parse the movie rating = 3.
For each valid record, you will increment movieCount. You will add the parsed "movie rating" to "movieRatingCount" and append the value to "movieValues" string.
You will get the desired output.
Following is the piece of code, which achieves this:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class MovieRatings {
public static class MovieRatingsMapper
extends Mapper<LongWritable, Text , IntWritable, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String valueStr = value.toString();
int index = valueStr.indexOf(',');
if(index != -1) {
try
{
IntWritable keyUserID = new IntWritable(Integer.parseInt(valueStr.substring(0, index)));
context.write(keyUserID, new Text(valueStr.substring(index + 1)));
}
catch(Exception e)
{
// You could get a NumberFormatException
}
}
}
}
public static class MovieRatingsReducer
extends Reducer<IntWritable, Text, IntWritable, Text> {
public void reduce(IntWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
int movieCount = 0;
int movieRatingCount = 0;
String movieValues = "";
for (Text value : values) {
String[] tokens = value.toString().split(",");
if(tokens.length == 2)
{
movieRatingCount += Integer.parseInt(tokens[1].trim()); // You could get a NumberFormatException
movieCount++;
movieValues = movieValues.concat(value.toString() + " ");
}
}
context.write(key, new Text(Integer.toString(movieCount) + "," + Integer.toString(movieRatingCount) + ",(" + movieValues.trim() + ")"));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "CompositeKeyExample");
job.setJarByClass(MovieRatings.class);
job.setMapperClass(MovieRatingsMapper.class);
job.setReducerClass(MovieRatingsReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("/in/in2.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
}
For the input:
17,70,3
35,21,1
49,19,2
49,21,1
49,70,4
87,19,1
87,21,2
98,19,2
I got the output:
17 1,3,(70,3)
35 1,1,(21,1)
49 3,7,(70,4 21,1 19,2)
87 2,3,(21,2 19,1)
98 1,2,(19,2)

Related

validate ArrayList contents against specific set of data

I want to check and verify that all of the contents in the ArrayList are similar to the value of a String variable. If any of the value is not similar, the index number to be printed with an error message like (value at index 2 didn't match the value of expectedName variable).
After I run the code below, it will print all the three indexes with the error message, it will not print only the index number 1.
Please note that here I'm getting the data from CSV file, putting it into arraylist and then validating it against the expected data in String variable.
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;
import java.io.Reader;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
public class ValidateVideoDuration {
private static final String CSV_FILE_PATH = "C:\\Users\\videologs.csv";
public static void main(String[] args) throws IOException {
String expectedVideo1Duration = "00:00:30";
String expectedVideo2Duration = "00:00:10";
String expectedVideo3Duration = "00:00:16";
String actualVideo1Duration = "";
String actualVideo2Duration = "";
String actualVideo3Duration = "";
ArrayList<String> actualVideo1DurationList = new ArrayList<String>();
ArrayList<String> actualVideo2DurationList = new ArrayList<String>();
ArrayList<String> actualVideo3DurationList = new ArrayList<String>();
try (Reader reader = Files.newBufferedReader(Paths.get(CSV_FILE_PATH));
CSVParser csvParser = new CSVParser(reader,
CSVFormat.DEFAULT.withFirstRecordAsHeader().withIgnoreHeaderCase().withTrim());) {
for (CSVRecord csvRecord : csvParser) {
// Accessing values by Header names
actualVideo1Duration = csvRecord.get("Video 1 Duration");
actualVideo1DurationList.add(actualVideo1Duration);
actualVideo2Duration = csvRecord.get("Video 2 Duration");
actualVideo2DurationList.add(actualVideo2Duration);
actualVideo3Duration = csvRecord.get("Video 3 Duration");
actualVideo3DurationList.add(actualVideo3Duration);
}
}
for (int i = 0; i < actualVideo2DurationList.size(); i++) {
if (actualVideo2DurationList.get(i) != expectedVideo2Duration) {
System.out.println("Duration of Video 1 at index number " + Integer.toString(i)
+ " didn't match the expected duration");
}
}
The data inside my CSV file look like the following:
video 1 duration, video 2 duration, video 3 duration
00:00:30, 00:00:10, 00:00:16
00:00:30, 00:00:15, 00:00:15
00:00:25, 00:00:10, 00:00:16
Don't use == or != for string compare. == checks the referential equality of two Strings and not the equality of the values. Use the .equals() method instead.
Change your if condition to if (!actualVideo2DurationList.get(i).equals(expectedVideo2Duration))

Java Hash map / Array List Count distinct values

I am pretty new into programming and I have an assignment to make, but I got stuck.
I have to implement a program which will read a CSV file (1 million+ lines) and count how many clients ordered "x" distinct products on a specific day.
The CSV looks like this:
Product Name | Product ID | Client ID | Date
Name 544 86 10/12/2017
Name 545 86 10/12/2017
Name 644 87 10/12/2017
Name 644 87 10/12/2017
Name 9857 801 10/12/2017
Name 3022 801 10/12/2017
Name 3021 801 10/12/2017
The result from my code is:
801: 2 - incorrect
86: 2 - correct
87: 2 - incorrect
Desired output is:
Client 1 (801): 3 distinct products
Client 2 (86): 2 distinct products
Client 3 (87): 1 distinct product
Additionally,
If I want to know how many clients ordered 2 distinct products I would like a result to look like this:
Total: 1 client ordered 2 distinct products
If I want to know the maximum number of distinct products ordered in a day, I would like the result to look like this:
The maximum number of distinct products ordered is: 3
I tried to use a Hash Map and Multimap by Google Guava (my best guess here), but I couldn't wrap my head around it.
My code looks like this:
package Test;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import com.google.common.collect.ArrayListMultimap;
import com.google.common.collect.HashMultimap;
public class Test {
public static void main(String[] args) {
//HashMultimap<String, String> myMultimap = HashMultimap.create();
Map<String, MutableInteger> map = new HashMap<String, MutableInteger>();
ArrayList<String> linesList = new ArrayList<>();
// Input of file which needs to be parsed
String csvFile = "file.csv";
BufferedReader csvReader;
// Data split by 'TAB' in CSV file
String csvSplitBy = "\t";
try {
// Read the CSV file into an ArrayList array for easy processing.
String line;
csvReader = new BufferedReader(new FileReader(csvFile));
while ((line = csvReader.readLine()) !=null) {
linesList.add(line);
}
csvReader.close();
} catch (IOException e) {
e.printStackTrace();
}
// Process each CSV file line which is now contained within
// the linesList list Array
for (int i = 0; i < linesList.size(); i++) {
String[] data = linesList.get(i).split(csvSplitBy);
String col2 = data[1];
String col3 = data[2];
String col4 = data[3];
// Determine if Column 4 has the desired date
// and count the values
if (col4.contains("10/12/2017")) {
String key = col3;
if (map.containsKey(key)) {
MutableInteger count = map.get(key);
count.set(count.get() + 1);
} else {
map.put(key, new MutableInteger(1));
}
}
}
for (final String k : map.keySet()) {
if (map.get(k).get() == 2) {
System.out.println(k + ": " + map.get(k).get());
}
}
}
}
Any advise or suggestion on how this can be implemented would be greatly appreciated.
Thank you in advance guys.
You could store a Setof productIds per clientId, and just take the size of that.
As a Set does not allow duplicate values, this will effectively give you the distinct number of productIds.
Also, I recommend that you give your variables meaningful name instead of col2, k, map... This will make your code more readable.
Map<String, Set<String>> distinctProductsPerClient = new HashMap<String, Set<String>>();
// Process each CSV file line which is now contained within
// the linesList list Array
// Start from 1 to skip the first line
for (int i = 1; i < linesList.size(); i++) {
String line = linesList.get(i);
String[] data = line.split(csvSplitBy);
String productId = data[1];
String clientId = data[2];
String date = data[3];
// Determine if Column 4 has the desired date
// and count the values
if (date.contains("10/12/2017")) {
if (!distinctProductsPerClient.containsKey(clientId)) {
distinctProductsPerClient.put(clientId, new HashSet<>());
}
distinctProductsPerClient.get(clientId).add(productId);
}
}
for (final String clientId : distinctProductsPerClient.keySet()) {
System.out.println(clientId + ": " + distinctProductsPerClient.get(clientId).size());
}
More advanced solution using Stream API (requires Java 9)
If you introduce the class OrderData(that represents a single line in the CSV) like this:
private static class OrderData {
private final String productName;
private final String productId;
private final String clientId;
private final String date;
public OrderData(String csvLine) {
String[] data = csvLine.split("\t");
this.productName = data[0];
this.productId = data[1];
this.clientId = data[2];
this.date = data[3];
}
public String getProductName() {
return productName;
}
public String getProductId() {
return productId;
}
public String getClientId() {
return clientId;
}
public String getDate() {
return date;
}
}
you can replace the for loop with this:
Map<String, Set<String>> distinctProductsPerClient2 = linesList.stream()
.skip(1)
.map(OrderData::new)
.collect(groupingBy(OrderData::getClientId, mapping(OrderData::getProductId, toSet())));
But I reckon this might be a little bit to complex if you're new into programming (although it might be a good exercise if you would try to understand what the above code does).

Hadoop - aggregating by prefix

I have words with prefix. eg:
city|new york
city|London
travel|yes
...
city|new york
I want to count how many city|new york and city|London(which is classic wordcount). But, the reducer output should be a key-val pair like city:{"new york" :2, "london":1}. Meaning for each city prefix, I want to aggregate all the Strings and their counts.
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
// Instead of just result count, I need something like {"city":{"new york" :2, "london":1}}
context.write(key, result);
}
Any ideas?
You can use cleanup() method of the reducer to achieve this (assuming, you have just one reducer). It is called once at the end of the reduce task.
I will explain this for "city" data.
Following is the code:
package com.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
public class Cities {
public static class CityMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text outKey = new Text();
private IntWritable outValue = new IntWritable(1);
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
outKey.set(value);
context.write(outKey, outValue);
}
}
public static class CityReducer
extends Reducer<Text,IntWritable,Text,Text> {
HashMap<String, Integer> cityCount = new HashMap<String, Integer>();
public void reduce(Text key, Iterable<IntWritable>values,
Context context
) throws IOException, InterruptedException {
for (IntWritable val : values) {
String keyStr = key.toString();
if(keyStr.toLowerCase().startsWith("city|")) {
String[] tokens = keyStr.split("\\|");
if(cityCount.containsKey(tokens[1])) {
int count = cityCount.get(tokens[1]);
cityCount.put(tokens[1], ++count);
}
else
cityCount.put(tokens[1], val.get());
}
}
}
#Override
public void cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException,
InterruptedException
{
String output = "{\"city\":{";
Iterator iterator = cityCount.entrySet().iterator();
while(iterator.hasNext())
{
Map.Entry entry = (Map.Entry) iterator.next();
output = output.concat("\"" + entry.getKey() + "\":" + Integer.toString((Integer) entry.getValue()) + ", ");
}
output = output.substring(0, output.length() - 2);
output = output.concat("}}");
context.write(output, "");
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "KeyValue");
job.setJarByClass(Cities.class);
job.setMapperClass(CityMapper.class);
job.setReducerClass(CityReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/in/in.txt"));
FileOutputFormat.setOutputPath(job, new Path("/out/"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
}
Mapper:
It just outputs count for each key it encounters. For e.g. if it encounters record "city|new york", then it will output (key, value) as ("city|new york", 1)
Reducer:
For each record, it checks if the key contains "city|". It splits the key on pipe ("|"). And stores the count for each city in a HashMap.
Reducer also overrides cleanup method. This method gets called once the reduce task is over. In this task, the contents of the HashMap are composed into the desired output.
In the cleanup(), the key is output as the contents of HashMap and value is output as empty string.
For e.g. I took the following data as input:
city|new york
city|London
city|new york
city|new york
city|Paris
city|Paris
I got the following output:
{"city":{"London":1, "new york":3, "Paris":2}}
It's simple.
Emit from mapper using the "city" as output key and the whole record as output value.
U will get city partitioned as a single group in a reducer and travel as another group.
Count the city and the travel instances using and hash map to grain down to lower levels.

Retrieving nth qualifier in hbase using java

This question is quite out of box but i need it.
In list(collection), we can retrieve the nth element in the list by list.get(i);
similarly is there any method, in hbase, using java API, where i can get the nth qualifier given the row id and ColumnFamily name.
NOTE: I have million qualifiers in single row in single columnFamily.
Sorry for being unresponsive. Busy with something important. Try this for right now :
package org.myorg.hbasedemo;
import java.io.IOException;
import java.util.Scanner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;
public class GetNthColunm {
public static void main(String[] args) throws IOException {
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "TEST");
Get g = new Get(Bytes.toBytes("4"));
Result r = table.get(g);
System.out.println("Enter column index :");
Scanner reader = new Scanner(System.in);
int index = reader.nextInt();
System.out.println("index : " + index);
int count = 0;
for (KeyValue kv : r.raw()) {
if(++count!=index)
continue;
System.out.println("Qualifier : "
+ Bytes.toString(kv.getQualifier()));
System.out.println("Value : " + Bytes.toString(kv.getValue()));
}
table.close();
System.out.println("Done.");
}
}
Will let you know if I get a better way to do this.

How to sort a list of the most frequently repeated words through Hadoop mapreduce WordCount? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Hi I am a newbie to hadoop mapreduce.
Could anyone of you help me to modify the below posted code to display the desired output?
I've a given input file as
Input: Hi my name is John.Im doing my engineering.My parents stay at California
I get the output as
Hi 1
my 3
name 1
is 1
is 1
John 1
doing 1
engineering 1
parents 1
stay 1
at 1
California 1
But I want the output to be sorted as
my 3
Hi 1
etc.....
then all the others to be displayed. The concept is to display the words that are repeated maximum times should be sorted and displayed first.
I'm running this job on a Single node. And I'm running this job as
$ hadoop jar job.jar input output
And i've started
$ hadoop namenode -format
$ hadoop namenode
$ hadoop datanode
sbin$ ./yarn-daemon.sh start resourcemanager
sbin$ ./yarn-daemon.sh start resourcemanager
I'm running hadoop-2.0.0-cdh4.0.0
package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.rg.apache.hadoop.fs.Path;
import oapache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
private static final Log LOG = LogFactory.getLog(WordCount.class);
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
//printKeyAndValues(key, values);
for (IntWritable val : values) {
sum += val.get();
LOG.info("val = " + val.get());
}
LOG.info("sum = " + sum + " key = " + key);
result.set(sum);
context.write(key, result);
//System.err.println(String.format("[reduce] word: (%s), count: (%d)", key, result.get()));
}
// a little method to print debug output
private void printKeyAndValues(Text key, Iterable<IntWritable> values)
{
StringBuilder sb = new StringBuilder();
for (IntWritable val : values)
{
sb.append(val.get() + ", ");
}
System.err.println(String.format("[reduce] key: (%s), value: (%s)", key, sb.toString()));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I would be great if anyone could sort this think out.
How about decreasing the count each time you find a word? Starting from 0 you will have -ve count of numbers.Highest count should come first then.

Categories

Resources