Get Specific data from MapReduce

Get Specific data from MapReduce - java

I have the following File as input which consists of 10000 lines as follows
250788965731,20090906,200937,200909,621,SUNDAY,WEEKEND,ON-NET,MORNING,OUTGOING,VOICE,25078,PAY_AS_YOU_GO_PER_SECOND_PSB,SUCCESSFUL-RELEASEDBYSERVICE,5,0,1,6.25,635-10-104-40163.
I had to print the first column if the 18th column is lesser than 10 and the 9th column is morning. I did the following code. i'm not getting the output. The output file is empty.
public static class MyMap extends Mapper<LongWritable, Text, Text, DoubleWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] day=line.split(",");
double day1=Double.parseDouble(day[17]);
if(day[8]=="MORNING" && day1<10.0)
{
context.write(new Text(day[0]),new DoubleWritable(day1));
}
}
}
public static class MyReduce extends Reducer<Text, DoubleWritable, Text,DoubleWritable> {
public void reduce(Text key, Iterator<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
String no=values.toString();
double no1=Double.parseDouble(no);
if(no1>10.0)
{
context.write(key,new DoubleWritable(no1) );
}
}
}
Please tell what I did wrong? Is the flow correct?

I can see a few problems.
First, in your Mapper, you should use .equals() instead of == when comparing Strings. Otherwise you're just comparing references, and the comparison will fail even if the String objects content is the same. There is a possibility that it might succeed because of Java String interning, but I would avoid relying too much on that if that was the original intent.
In your Reducer, I am not sure what you want to achieve, but there are a few wrong things that I can spot anyway. The input key is an Iterable<DoubleWritable>, so you should iterate over it and apply whatever condition you need on each individual value. Here is how I would rewrite your Reducer:
public static class MyReduce extends Reducer<Text, DoubleWritable, Text,DoubleWritable> {
public void reduce(Text key, Iterator<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
for (DoubleWritable val : values) {
if (val.get() > 10.0) {
context.write(key, val);
}
}
}
}
But the overall logic doesn't make much sense. If all you want to do is print the first column when the 18th column is less than 10 and the 9th column is MORNING, then you could use a NullWritable as output key of your mapper, and write column 1 day[0] as your output value. You probably don't even need Reducer in this case, which you could tell Hadoop with job.setNumReduceTasks(0);.
One thing that got me thinking, if your input is only 10k lines, do you really need a Hadoop job for this? It seems to me a simple shell script (for example with awk) would be enough for this small dataset.
Hope that helps !

I believe this is a mapper only job as your data already has the values you want to check.
Your mapper has emitted values with day1 < 10.0 while your reducer emits only value ie. day1 > 10.0 hence none of the values would be outputted by your reducers.
So I think your reducer should look like this:
String no=values.toString();
double no1=Double.parseDouble(no);
if(no1 < 10.0)
{
context.write(key,new DoubleWritable(no1) );
}
I think that should get your desired output.

Related

Why List<String> is always empty when using MapReduce and HDFS?

So I have a program that it uses Mapper, Combiner and Reducer to get some fields of the IMDB repository and this program fine works when I'm running it on my machine.
When I put this code to run inside Docker using Hadoop HDFS it doesn't get some values that I need, or to be precise, the Combiner which puts some values into a List, that is a public class variable, doesn't work or something because when I try to use that List in the Reducer it looks like it is always empty. When I was running on my machine (without Docker and Hadoop HDFS) it would put the values into the List but when running on Docker it looks like it is always empty. I have also printed the size of the List on the main and it returns 0, any suggestions?
public class FromParquetToParquetFile{
public static List<String> top10 = new ArrayList<>();
....
}
The Combiner looks like:
public static class FromParquetQueriesCombiner extends Reducer<Text,Text, Text,Text> {
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
long total = 0;
long maior = -1;
String tconst = "";
String title = "";
for (Text value : values) {
total++; //numero de filmes
String[] fields = value.toString().split("\t");
top10.add(key.toString() + "\t" + fields[2] + "\t" + fields[3] + "\t" + fields[0] + "\t" + fields[1]);
int x = Integer.parseInt(fields[3]);
if (x >= maior) {
tconst = fields[0];
title = fields[1];
maior = x;
}
}
StringBuilder result = new StringBuilder();
result.append(total);
result.append("\t");
result.append(tconst);
result.append("\t");
result.append(title);
result.append("\t");
context.write(key, new Text(result.toString()));
}
}
And Reducer looks like (it has a setup to order the List):
public static class FromParquetQueriesReducer extends Reducer<Text, Text, Void, GenericRecord> {
private Schema schema;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
Collections.sort(top10, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
String[] aux = o1.split("\t");
String[] aux2 = o2.split("\t");
...
return -result;
}
});
schema = getSchema("hdfs:///schema.alinea2");
}
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
...
for(String s : top10)
...
}
}

As explained here, "public" variables (in Java sense) don't quite "translate" into a parallel computing model aimed to be implemented in a distributed system (and this is why while you didn't have any issue running your application locally, things "broke" when you run it along the HDFS).
Mapper and Reducer instances are isolated and more-or-less "independent" from whatever being put "around" the functions that describe them. That means they don't really have access to the variables being put either on the parent class (i.e. FromParquetToParquetFile here) or in the driver/main function of the program. From that we can understand that (in case you want to preserve the current way of functionality your job has) we need some type of risky workaround (or a straight up hack job) to make a list publicly accessible and "static" within the thematic constraints we are working on.
The solution for this is to set user-named values that are referring to the job's Configuration object. This means that you have to use the Configuration object you probably created in your driver to set top10 as this type of variable. Since your List may have relatively "small" Strings in length (i.e. just several sentences) for each element, all you have to do is use some sort of a delimiter to store all of the elements in just one String (since this is the datatype used for those type of Configuration variables) like this element1#element2#element3#... (but be very careful with this, as you must always be sure that there's enough memory for that String to exist in the first place, this is why this is merely a workaround after all).
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
conf.set("top10", " "); // initialize `top10` as an empty String
// the description of the job(s), etc, ...
}
In order to read and write to top10, at first you need to declare it to the setup function that you need to have in both of your combiner and your reducer like this (with the code snippet below showing how it would look like for the reducer, of course):
public static class FromParquetQueriesReducer extends Reducer<Text, Text, Void, GenericRecord>
{
private Schema schema;
private String top10;
#Override
protected void setup(Context context) throws IOException, InterruptedException
{
top10 = context.getConfiguration().get("top10");
// everything else inside the setup function...
}
// ...
}
With those two adjustments, you can use top10 inside the reduce function of yours just fine, after using the split function to split the elements from inside the String of top10 like this:
String[] data = top10.split("#"); // split the elements from the String
List<String> top10List = new ArrayList<>(); // create ArrayList
Collections.addAll(top10List, data); // put all the elements to the list
With all that being said, I must say that this type of functionality is way beyond the abilities of vanilla Hadoop that heavily relies on MapReduce. In case this is anything more than a CS class assignment, you need to reevaluate the usage of Hadoop's MapReduce engine here, in order to make out of all of this with "extensions" like Apache Hive or Apache Spark that are way more flexible and SQL-like and can match some of the aspects of your application.

Hadoop - Problem with Text to float conversion

I have a csv file containing key-value pairs; it can have multiple records for the same key. I am writing a mapreduce program to aggregate this data - for each key, it is supposed to give the frequency of key and sum of values for the key.
My mapper reads the csv file and emits both key and value as Text type eventhough they are numeric (doing this way because I am running into problems using FloatWritable for value).
In the reducer, when I try to convert the Text value to float, I am running into NumberFormatException and the value shown in the error is not even in my input.
Heres my code:
public static class AggReducer
extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<FloatWritable> values,
Context context
) throws IOException, InterruptedException {
int numTrips = 0;
int totalFare = 0;
for (Text val : values) {
totalFare += Float.parseFloat(val.toString());
numTrips++;
}
String resultStr = String.format("%1s,%2s", numTrips, totalFare);
result.set(resultStr);
context.write(key, result);
}
}
Note : I made the reducer produce mapper's output without any changes and that gave the expected output

running into NumberFormatException and the value shown in the error is not even in my input
Well, that's quite impossible. Somewhere the value needs to be in your input or generated mapper output. Try catch works just as well in a reducer as anywhere else, though
FWIW, use DoubleWritable

"Shortcut" to determine maximum element in Iterator<IntWritable> in reduce() method

I have written the below reduce() method to determine the maximum recorded temperature for a given year. (map()'s output gives a list of temperatures recorded in a year.)
public void reduce(IntWritable year
, Iterator<IntWritable> temps
, OutputCollector<IntWritable, IntWritable> output
, Reporter reporter)
throws IOException {
int maxValue = Integer.MIN_VALUE;
while(temps.hasNext()) {
int next = temps.next().get();
if(next > maxValue) {
maxValue = next;
}
}
output.collect(year, new IntWritable(maxValue));
}
I am curious to know if there's a "shortcut", such as a pre-defined method, to eliminate the while loop, and obtain the maximum value directly. I am looking for something similar to c++'s std::max(). I found this (Convert std::max to Java) by searching here, but I couldn't figure out how to convert my Iterator<IntWritable> to Collections.
I am a beginner in Java, but proficient in C++, so I am trying to learn various techniques used in Java as well.

Unfortunately, there is no way to convert an Iterator into Collections, unless you use another library such as Guava or Apache Commons Collections. With the second one for example, you can convert it to a List and then call the Collection.max function.
List<IntWritable> list = IteratorUtils.toList(temps);
If you don't want to use external libraries, then there is no other option. You could reduce a little bit your code with a for each loop, although the result is not that different:
for(IntWritable intWritable : temps) {
if(intWritable.get() > maxValue) {
maxValue = intWritable;
}
}

Sending mapper output to different reducer

I am new to Hadoop and now I am working with java mapper/reducer codes. While working, I came across a problem that I have to pass the output of mapper class to two different reducer class.If it is possible or not.Also can we send two different outputs from same mapper class...Can any one tell me..

I've been trying to do the same. Based on what I found, we cannot have mapper output send to two reducers. But could perform the task that you wanted to do in two reducers in one by differentiating the tasks in the reducer. The reducer can select the task based on some key criteria. I must warn you I'm new to hadoop so may not be the best answer.
The mapper will generate keys like this +-TASK_XXXX. The reducer will then invoke different methods to process TASK_XXXX
Think it is better to have TASK_NAME at the end to ensure effective partitioning.
As for your second question, I believe you can send multiple output from same mapper class to reducer. This post maybe of interest to you Can Hadoop mapper produce multiple keys in output?
The map method would look like
#Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
//do stuff 1
Text outKey1 = new Text(<Your_Original_Key>+"-TASK1");
context.write(outKey, task1OutValues);
//do stuff 2
Text outKey2 = new Text(<Your_Original_Key>+"-TASK2");
context.write(outKey, task2OutValues);
}
and reduce method
#Override
protected void reduce(Text inkey, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException {
String key = inKey.toString();
if(inKey.matches(".*-TASK1$")) {
processTask1(values);
} else if(inKey.matches(".*-TASK2$")) {
processTask2(values);
}
}

Hadoop sorting issue (Alternate title: 1175 is not less than 119!)

I'm new to Hadoop and done with a typical "count the IP addresses in a log" exercise. Now I'm trying to sort the output by running a second MapReduce job immediately after the first. Almost everything is working, except for the fact that the output collector isn't quite processing the sort the way I'd like. Here's a snippet of my output:
-101 71.59.196.132
-115 59.103.11.163
-1175 59.93.51.231
-119 127.0.0.1
-1193 115.186.128.19
-1242 59.93.64.161
-146 192.35.79.70
I can't figure out why, for example, 1175 is considered a lower value than 119. I've tried playing around with Comparators, but it hasn't had any positive effect.
The Map and Reduce jobs for the data collection are both standard and non-problematic. They output a list much like the snippet above, but completely unsorted. The SortMap, SortReduce, and Runner classes are a little different. Here's my Runner class:
public class Runner {
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(Runner.class);
JobConf sortStage = new JobConf(Runner.class);
conf.setJobName("ip-count");
conf.setMapperClass(IpMapper.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setReducerClass(IpReducer.class);
conf.setOutputValueGroupingComparator(IntWritable.Comparator.class);
//Input and output from command line...
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
sortStage.setJobName("sort-stage");
sortStage.setMapperClass(SortMapper.class);
sortStage.setMapOutputKeyClass(Text.class);
sortStage.setMapOutputValueClass(IntWritable.class);
sortStage.setReducerClass(SortReducer.class);
sortStage.setOutputKeyClass(IntWritable.class);
sortStage.setOutputValueClass(IntWritable.class);
//Input and output from command line...
FileInputFormat.setInputPaths(sortStage, new Path(args[2]));
FileOutputFormat.setOutputPath(sortStage, new Path(args[3]));
JobClient.runJob(conf);
JobClient.runJob(sortStage);
}
}
The "SortMapper":
public class SortMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable>
{
private static final IntWritable one = new IntWritable(1);
public void map(LongWritable fileOffset, Text lineContents,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
{
//Grab the whole string, formatted as (Count /t IP), e.g., 101 128.10.3.40
String ip = lineContents.toString();
//Output it with a count of 1
output.collect(new Text(ip), one);
}
}
}
The "SortReducer":
public class SortReducer extends MapReduceBase implements Reducer<Text, IntWritable,
IntWritable, Text>
{
public void reduce(Text ip, Iterator<IntWritable> counts,
OutputCollector<IntWritable, Text> output, Reporter reporter)
throws IOException{
String delimiter = "[\t]";
String[] splitString = ip.toString().split(delimiter);
//Count represented as 0-count to easily sort in descending order vs. ascending
int sortCount = 0-Integer.parseInt(splitString[0]);
output.collect(new IntWritable(sortCount), new Text(splitString[1]));
}
}
This is just a single-node job, so I don't think partitioning is a factor. Sorry if this is a trivial matter - I've spent an embarrassing amount of time on the problem and couldn't find anything that dealt with this particular sorting issue. Any advice would be greatly appreciated!

Your numbers are being compared 'alphabetically'. This is because there are strings. If you imagine alphabetical sorting, aabc comes before aac. If you turn these into numbers, 1123 comes before 113.
If you want numeric comparison, you are going to have to convert them to integers.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get Specific data from MapReduce - java

Related

Why List<String> is always empty when using MapReduce and HDFS?

Hadoop - Problem with Text to float conversion

"Shortcut" to determine maximum element in Iterator<IntWritable> in reduce() method

Sending mapper output to different reducer

Hadoop sorting issue (Alternate title: 1175 is not less than 119!)

Categories

Resources