Hadoop - Problem with Text to float conversion - java

I have a csv file containing key-value pairs; it can have multiple records for the same key. I am writing a mapreduce program to aggregate this data - for each key, it is supposed to give the frequency of key and sum of values for the key.
My mapper reads the csv file and emits both key and value as Text type eventhough they are numeric (doing this way because I am running into problems using FloatWritable for value).
In the reducer, when I try to convert the Text value to float, I am running into NumberFormatException and the value shown in the error is not even in my input.
Heres my code:
public static class AggReducer
extends Reducer<Text,Text,Text,Text> {
private Text result = new Text();
public void reduce(Text key, Iterable<FloatWritable> values,
Context context
) throws IOException, InterruptedException {
int numTrips = 0;
int totalFare = 0;
for (Text val : values) {
totalFare += Float.parseFloat(val.toString());
numTrips++;
}
String resultStr = String.format("%1s,%2s", numTrips, totalFare);
result.set(resultStr);
context.write(key, result);
}
}
Note : I made the reducer produce mapper's output without any changes and that gave the expected output

running into NumberFormatException and the value shown in the error is not even in my input
Well, that's quite impossible. Somewhere the value needs to be in your input or generated mapper output. Try catch works just as well in a reducer as anywhere else, though
FWIW, use DoubleWritable

Related

Why List<String> is always empty when using MapReduce and HDFS?

So I have a program that it uses Mapper, Combiner and Reducer to get some fields of the IMDB repository and this program fine works when I'm running it on my machine.
When I put this code to run inside Docker using Hadoop HDFS it doesn't get some values that I need, or to be precise, the Combiner which puts some values into a List, that is a public class variable, doesn't work or something because when I try to use that List in the Reducer it looks like it is always empty. When I was running on my machine (without Docker and Hadoop HDFS) it would put the values into the List but when running on Docker it looks like it is always empty. I have also printed the size of the List on the main and it returns 0, any suggestions?
public class FromParquetToParquetFile{
public static List<String> top10 = new ArrayList<>();
....
}
The Combiner looks like:
public static class FromParquetQueriesCombiner extends Reducer<Text,Text, Text,Text> {
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
long total = 0;
long maior = -1;
String tconst = "";
String title = "";
for (Text value : values) {
total++; //numero de filmes
String[] fields = value.toString().split("\t");
top10.add(key.toString() + "\t" + fields[2] + "\t" + fields[3] + "\t" + fields[0] + "\t" + fields[1]);
int x = Integer.parseInt(fields[3]);
if (x >= maior) {
tconst = fields[0];
title = fields[1];
maior = x;
}
}
StringBuilder result = new StringBuilder();
result.append(total);
result.append("\t");
result.append(tconst);
result.append("\t");
result.append(title);
result.append("\t");
context.write(key, new Text(result.toString()));
}
}
And Reducer looks like (it has a setup to order the List):
public static class FromParquetQueriesReducer extends Reducer<Text, Text, Void, GenericRecord> {
private Schema schema;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
Collections.sort(top10, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
String[] aux = o1.split("\t");
String[] aux2 = o2.split("\t");
...
return -result;
}
});
schema = getSchema("hdfs:///schema.alinea2");
}
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
...
for(String s : top10)
...
}
}
As explained here, "public" variables (in Java sense) don't quite "translate" into a parallel computing model aimed to be implemented in a distributed system (and this is why while you didn't have any issue running your application locally, things "broke" when you run it along the HDFS).
Mapper and Reducer instances are isolated and more-or-less "independent" from whatever being put "around" the functions that describe them. That means they don't really have access to the variables being put either on the parent class (i.e. FromParquetToParquetFile here) or in the driver/main function of the program. From that we can understand that (in case you want to preserve the current way of functionality your job has) we need some type of risky workaround (or a straight up hack job) to make a list publicly accessible and "static" within the thematic constraints we are working on.
The solution for this is to set user-named values that are referring to the job's Configuration object. This means that you have to use the Configuration object you probably created in your driver to set top10 as this type of variable. Since your List may have relatively "small" Strings in length (i.e. just several sentences) for each element, all you have to do is use some sort of a delimiter to store all of the elements in just one String (since this is the datatype used for those type of Configuration variables) like this element1#element2#element3#... (but be very careful with this, as you must always be sure that there's enough memory for that String to exist in the first place, this is why this is merely a workaround after all).
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
conf.set("top10", " "); // initialize `top10` as an empty String
// the description of the job(s), etc, ...
}
In order to read and write to top10, at first you need to declare it to the setup function that you need to have in both of your combiner and your reducer like this (with the code snippet below showing how it would look like for the reducer, of course):
public static class FromParquetQueriesReducer extends Reducer<Text, Text, Void, GenericRecord>
{
private Schema schema;
private String top10;
#Override
protected void setup(Context context) throws IOException, InterruptedException
{
top10 = context.getConfiguration().get("top10");
// everything else inside the setup function...
}
// ...
}
With those two adjustments, you can use top10 inside the reduce function of yours just fine, after using the split function to split the elements from inside the String of top10 like this:
String[] data = top10.split("#"); // split the elements from the String
List<String> top10List = new ArrayList<>(); // create ArrayList
Collections.addAll(top10List, data); // put all the elements to the list
With all that being said, I must say that this type of functionality is way beyond the abilities of vanilla Hadoop that heavily relies on MapReduce. In case this is anything more than a CS class assignment, you need to reevaluate the usage of Hadoop's MapReduce engine here, in order to make out of all of this with "extensions" like Apache Hive or Apache Spark that are way more flexible and SQL-like and can match some of the aspects of your application.

"Shortcut" to determine maximum element in Iterator<IntWritable> in reduce() method

I have written the below reduce() method to determine the maximum recorded temperature for a given year. (map()'s output gives a list of temperatures recorded in a year.)
public void reduce(IntWritable year
, Iterator<IntWritable> temps
, OutputCollector<IntWritable, IntWritable> output
, Reporter reporter)
throws IOException {
int maxValue = Integer.MIN_VALUE;
while(temps.hasNext()) {
int next = temps.next().get();
if(next > maxValue) {
maxValue = next;
}
}
output.collect(year, new IntWritable(maxValue));
}
I am curious to know if there's a "shortcut", such as a pre-defined method, to eliminate the while loop, and obtain the maximum value directly. I am looking for something similar to c++'s std::max(). I found this (Convert std::max to Java) by searching here, but I couldn't figure out how to convert my Iterator<IntWritable> to Collections.
I am a beginner in Java, but proficient in C++, so I am trying to learn various techniques used in Java as well.
Unfortunately, there is no way to convert an Iterator into Collections, unless you use another library such as Guava or Apache Commons Collections. With the second one for example, you can convert it to a List and then call the Collection.max function.
List<IntWritable> list = IteratorUtils.toList(temps);
If you don't want to use external libraries, then there is no other option. You could reduce a little bit your code with a for each loop, although the result is not that different:
for(IntWritable intWritable : temps) {
if(intWritable.get() > maxValue) {
maxValue = intWritable;
}
}

Sending mapper output to different reducer

I am new to Hadoop and now I am working with java mapper/reducer codes. While working, I came across a problem that I have to pass the output of mapper class to two different reducer class.If it is possible or not.Also can we send two different outputs from same mapper class...Can any one tell me..
I've been trying to do the same. Based on what I found, we cannot have mapper output send to two reducers. But could perform the task that you wanted to do in two reducers in one by differentiating the tasks in the reducer. The reducer can select the task based on some key criteria. I must warn you I'm new to hadoop so may not be the best answer.
The mapper will generate keys like this +-TASK_XXXX. The reducer will then invoke different methods to process TASK_XXXX
Think it is better to have TASK_NAME at the end to ensure effective partitioning.
As for your second question, I believe you can send multiple output from same mapper class to reducer. This post maybe of interest to you Can Hadoop mapper produce multiple keys in output?
The map method would look like
#Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException {
//do stuff 1
Text outKey1 = new Text(<Your_Original_Key>+"-TASK1");
context.write(outKey, task1OutValues);
//do stuff 2
Text outKey2 = new Text(<Your_Original_Key>+"-TASK2");
context.write(outKey, task2OutValues);
}
and reduce method
#Override
protected void reduce(Text inkey, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException {
String key = inKey.toString();
if(inKey.matches(".*-TASK1$")) {
processTask1(values);
} else if(inKey.matches(".*-TASK2$")) {
processTask2(values);
}
}

Advantages of using NullWritable in Hadoop

What are the advantages of using NullWritable for null keys/values over using null texts (i.e. new Text(null)). I see the following from the «Hadoop: The Definitive Guide» book.
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need
to use that position—it effectively stores a constant empty value. NullWritable can also
be useful as a key in SequenceFile when you want to store a list of values, as opposed
to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling
NullWritable.get()
I do not clearly understand how the output is written out using NullWritable? Will there be a single constant value in the beginning output file indicating that the keys or values of this file are null, so that the MapReduce framework can ignore reading the null keys/values (whichever is null)? Also, how actually are null texts serialized?
Thanks,
Venkat
The key/value types must be given at runtime, so anything writing or reading NullWritables will know ahead of time that it will be dealing with that type; there is no marker or anything in the file. And technically the NullWritables are "read", it's just that "reading" a NullWritable is actually a no-op. You can see for yourself that there's nothing at all written or read:
NullWritable nw = NullWritable.get();
ByteArrayOutputStream out = new ByteArrayOutputStream();
nw.write(new DataOutputStream(out));
System.out.println(Arrays.toString(out.toByteArray())); // prints "[]"
ByteArrayInputStream in = new ByteArrayInputStream(new byte[0]);
nw.readFields(new DataInputStream(in)); // works just fine
And as for your question about new Text(null), again, you can try it out:
Text text = new Text((String)null);
ByteArrayOutputStream out = new ByteArrayOutputStream();
text.write(new DataOutputStream(out)); // throws NullPointerException
System.out.println(Arrays.toString(out.toByteArray()));
Text will not work at all with a null String.
I change the run method. and success
#Override
public int run(String[] strings) throws Exception {
Configuration config = HBaseConfiguration.create();
//set job name
Job job = new Job(config, "Import from file ");
job.setJarByClass(LogRun.class);
//set map class
job.setMapperClass(LogMapper.class);
//set output format and output table name
//job.setOutputFormatClass(TableOutputFormat.class);
//job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "crm_data");
//job.setOutputKeyClass(ImmutableBytesWritable.class);
//job.setOutputValueClass(Put.class);
TableMapReduceUtil.initTableReducerJob("crm_data", null, job);
job.setNumReduceTasks(0);
TableMapReduceUtil.addDependencyJars(job);
FileInputFormat.addInputPath(job, new Path(strings[0]));
int ret = job.waitForCompletion(true) ? 0 : 1;
return ret;
}
You can always wrap your string in your own Writable class and have a boolean indicating it has blank strings or not:
#Override
public void readFields(DataInput in) throws IOException {
...
boolean hasWord = in.readBoolean();
if( hasWord ) {
word = in.readUTF();
}
...
}
and
#Override
public void write(DataOutput out) throws IOException {
...
boolean hasWord = StringUtils.isNotBlank(word);
out.writeBoolean(hasWord);
if(hasWord) {
out.writeUTF(word);
}
...
}

Get Specific data from MapReduce

I have the following File as input which consists of 10000 lines as follows
250788965731,20090906,200937,200909,621,SUNDAY,WEEKEND,ON-NET,MORNING,OUTGOING,VOICE,25078,PAY_AS_YOU_GO_PER_SECOND_PSB,SUCCESSFUL-RELEASEDBYSERVICE,5,0,1,6.25,635-10-104-40163.
I had to print the first column if the 18th column is lesser than 10 and the 9th column is morning. I did the following code. i'm not getting the output. The output file is empty.
public static class MyMap extends Mapper<LongWritable, Text, Text, DoubleWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] day=line.split(",");
double day1=Double.parseDouble(day[17]);
if(day[8]=="MORNING" && day1<10.0)
{
context.write(new Text(day[0]),new DoubleWritable(day1));
}
}
}
public static class MyReduce extends Reducer<Text, DoubleWritable, Text,DoubleWritable> {
public void reduce(Text key, Iterator<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
String no=values.toString();
double no1=Double.parseDouble(no);
if(no1>10.0)
{
context.write(key,new DoubleWritable(no1) );
}
}
}
Please tell what I did wrong? Is the flow correct?
I can see a few problems.
First, in your Mapper, you should use .equals() instead of == when comparing Strings. Otherwise you're just comparing references, and the comparison will fail even if the String objects content is the same. There is a possibility that it might succeed because of Java String interning, but I would avoid relying too much on that if that was the original intent.
In your Reducer, I am not sure what you want to achieve, but there are a few wrong things that I can spot anyway. The input key is an Iterable<DoubleWritable>, so you should iterate over it and apply whatever condition you need on each individual value. Here is how I would rewrite your Reducer:
public static class MyReduce extends Reducer<Text, DoubleWritable, Text,DoubleWritable> {
public void reduce(Text key, Iterator<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
for (DoubleWritable val : values) {
if (val.get() > 10.0) {
context.write(key, val);
}
}
}
}
But the overall logic doesn't make much sense. If all you want to do is print the first column when the 18th column is less than 10 and the 9th column is MORNING, then you could use a NullWritable as output key of your mapper, and write column 1 day[0] as your output value. You probably don't even need Reducer in this case, which you could tell Hadoop with job.setNumReduceTasks(0);.
One thing that got me thinking, if your input is only 10k lines, do you really need a Hadoop job for this? It seems to me a simple shell script (for example with awk) would be enough for this small dataset.
Hope that helps !
I believe this is a mapper only job as your data already has the values you want to check.
Your mapper has emitted values with day1 < 10.0 while your reducer emits only value ie. day1 > 10.0 hence none of the values would be outputted by your reducers.
So I think your reducer should look like this:
String no=values.toString();
double no1=Double.parseDouble(no);
if(no1 < 10.0)
{
context.write(key,new DoubleWritable(no1) );
}
I think that should get your desired output.

Categories

Resources