I'm getting ArrayIndexOutofBoundsException next to String temp = word[5]; in my mapper.
I've researched this and I know what the error is coming from (when the input data is empty or the length is less or more than the index specified in the code. My data has some empty cell values)
I've tried to catch the array index error using the following code but it still gives me error.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class AvgMaxTempMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, DoubleWritable> {
public void map(LongWritable key, Text value, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
if(line != null && !line.isEmpty() && str.matches(".*\\d+.*"));
String [] word = line.split(",");
String month = word[3];
String temp = word[5];
if (temp.length() > 1 && temp.length() < 5){
Double avgtemp = Double.parseDouble(temp);
output.collect(new Text(month), new DoubleWritable(avgtemp));
}
}
}
If you could please give me any hints or tips to whether the error is in this code or I should look somewhere else, that would save a lot of stress!
By throwing the exception in the method signature, you're basically causing the entire mapper to stop whenever it encounters a single "bad" line of data. What you actually want to do is have the mapper ignore that line of data but keep processing other lines.
You should check the length of word[] immediately after split(). If it's not long enough, stop processing that line. You'll also want to check that month and temp are valid after you've extracted them. How about:
String [] word = line.split(",");
if (word == null || word.length < 6) {
break;
}
String month = word[3];
if (month != null) {
break;
}
String temp = word[5];
if (temp != null && temp.length() > 1 && temp.length() < 5) {
try {
Double avgtemp = Double.parseDouble(temp);
} catch (NumberFormatException ex) {
//Log that you've seen a dodgy temperature
break;
}
output.collect(new Text(month), new DoubleWritable(avgtemp));
}
It's very important to validate data in MapReduce jobs, as you can never guarantee what you'll get as input.
You might also want to look at ApacheCommons StringUtils and ArrayUtils classes - they provide methods such as StringUtils.isEmpty(temp) and ArrayUtils.isEmpty(word) that will neaten up the above.
I would recommend using a custom counter instead, which you will increase every time you find an empty cell. This will give you a picture of how many such lines exist in your data.
Along with some other efficiency modifications, my suggestion is the following:
import java.io.IOException; //do you still need this?
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class AvgMaxTempMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, DoubleWritable> {
public static enum STATS {MISSING_VALUE};
private Text outKey = new Text();
private DoubleWritable outValue = new DoubleWritable();
public void map(LongWritable key, Text value, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
if(line.matches(".*\\d+.*"));
String [] word = line.split(",");
if (word.length < 6) { //or whatever else you consider expected
reporter.incrCounter(STATS.MISSING_VALUE,1); //you can also print/log an error message if you like
return;
}
String month = word[3];
String temp = word[5];
if (temp.length() > 1 && temp.length() < 5){
Double avgtemp = Double.parseDouble(temp);
outKey.set(month);
outValue.set(avgtemp);
output.collect(outKey, outValue);
} //you were missing this '}'
}
}
}
Related
So my class's lab assingment is to read in a text file with a city's name and some temperatures on one line (with any amount of lines) ex:
Montgomery 15 9.5 17
and print the name and the average of the temperatures. When printing out the name has to be to the left on the line, and the average should be printed with two digits after the decimal point, to the right on the line. I can assume no name is 30+ characters, and that the average can be printed in a field 10 characters wide (including the decimal point).
Here's what I have so far
import java.util.Scanner;
import java.io.*;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Formatter;
public class readFile2
{
public static void main(String[] args) throws IOException
{
FileReader a = new FileReader("places.txt");
StreamTokenizer b = new StreamTokenizer(a);
ArrayList<Double> c = new ArrayList<Double>();
/*System.out.println(
String.format("%-10s:%10s", "ABCD", "ZYXW"));
**I found this to format an output. */
double count = 0;
while(b.nextToken() != b.TT_EOF);
{
if(b.ttype == b.TT_WORD)
{
System.out.print(b.sval);
}
if(b.ttype == b.TT_NUMBER)
{
c.add(b.nval);
count ++;
}
double totaltemp = 0; // I need to figure out how to put
// this inside of an if statement
for(int i = 0; i < c.size(); i++) // that can tell when it reaches
{ // the last number/the end of the line
totaltemp = c.get(i) + totaltemp; //
count++; //
} //
//
System.out.print(totaltemp/count); //
}
}
}
The second part is to modify the program so that the name of a cty may consist of more than one word (e.g., `New York')
I really appreciate any and all help and advice :)
I have a text file like with tab delimiter
20001204X00000 Accident 10 9 6 Hyd
20001204X00001 Accident 8 7 vzg 2
20001204X00002 Accident 10 7 sec 1
20001204X00003 Accident 23 9 kkd 23
I want to get the output flight id,total number of passengers, here I have to sum all numerical columns values for total number of passengers Like this
20001204X00000 25
20001204X00001 17
20001204X00002 18
20001204X00003 55
When try to add the four numerical columns I got NullPointer exception, please help how to avoid nullPointerException and how to replace the null or white space values with zero
Actually This is Hadoop Map reduce Java Code
package com.flightsdamage.mr;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FlightsDamage {
public static class FlightsMaper extends Mapper<LongWritable, Text, Text, LongWritable> {
LongWritable pass2;
#Override
protected void map(LongWritable key, Text value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException,NumberFormatException,NullPointerException {
String line = value.toString();
String[] column=line.split("|");
Text word=new Text();
word.set(column[0]);
String str = "n";
try {
long a = Long.parseLong(str);
long a1=Long.parseLong("col1[]");
long a2=Long.parseLong("col2[]");
long a3=Long.parseLong("col3[]");
long a4=Long.parseLong("col4[]");
long sum = a1+a2+a3+a4;
LongWritable pass0 = new LongWritable(a1);
LongWritable pass = new LongWritable(a2);
LongWritable pass1 = new LongWritable(a3);
LongWritable pass3 = new LongWritable(a4);
pass2 = new LongWritable(sum);
} catch (Exception e) {
// TODO: handle exception
}finally{
context.write(word,pass2);
}
}
}
public static void main(String[] args)throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Flights MR");
job.setJarByClass(FlightsDamage.class);
job.setMapperClass(FlightsMaper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//FileInputFormat.addInputPath(job, new Path("/home/node1/data-AviationData.txt"));
FileInputFormat.addInputPath(job, new Path("/home/node1/Filghtdamage.txt"));
FileOutputFormat.setOutputPath(job, new Path("/home/node1/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You need to check if the string is of numeric type before parsing it. Like:
int value = 0;
if (StringUtils.isNumeric(str)) {
value = Integer.parseInt(str);
}
If the input string is non numeric (be it null or other non numeric value), StringUtils.isNumeric() will return false and the variable will have 0 as default value.
Here is a simple program which demonstrate the usage of StringUtils.isNumeric()
Test Class:
import org.apache.commons.lang3.StringUtils;
public class LineParse {
public static void main(String[] args) {
String[] input = {
"20001204X00000\tAccident\t10\t9\t6\tHyd",
"20001204X00001\tAccident\t\t8\t7\tvzg\t2",
"20001204X00002\tAccident\t10\t7\t\tsec\t1",
"20001204X00003\tAccident\t23\t\t9\tkkd\t23"
};
StringBuilder output = new StringBuilder();
for (String line : input) {
int sum = 0;
String[] tokens = line.split("\t");
if (tokens.length > 0) {
output.append(tokens[0]);
output.append("\t");
for (int i = 1;i < tokens.length;i++) {
// Check if String is of type numeric.
if (StringUtils.isNumeric(tokens[i])) {
sum += Integer.parseInt(tokens[i]);
}
}
}
output.append(sum);
output.append("\n");
}
System.out.println(output.toString());
}
}
Output:
20001204X00000 25
20001204X00001 17
20001204X00002 18
20001204X00003 55
I have assumed that all the numbers will be Integer. Otherwise use Double.parseDouble().
Here is the background. I have the following input for my MapReduce job (example):
Apache Hadoop
Apache Lucene
StackOverflow
....
(Actually each line represents a user query. Not important here.) And I want my RecordReader class read one line and then pass several key-value pairs to mappers. For example, if RecordReader gets Apache Hadoop, then I want it to generate the following key-value pairs and pass it to mappers:
Apache Hadoop - 1
Apache Hadoop - 2
Apache Hadoop - 3
("-" is the separator here.) And I found RecordReader pass key-values in next() method:
next(key, value);
Every time a RecordReader.next() is called, only one key and one value will be passed as argument. So how should I get my work done?
I believe you can simply use this:
public static class MultiMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for (int i = 1; i <= n; i++) {
context.write(value, new IntWritable(i));
}
}
}
Here n is the number of values you want to pass. For example for the key-value pairs you specified:
Apache Hadoop - 1
Apache Hadoop - 2
Apache Hadoop - 3
n would be 3.
I think if you want to send to the mapper use the same key; you must implement your owner RecordReader; for example you can wirte a MutliRecordReader to extends the LineRecordReade; and here you must change the nextKeyValue method;
this is the original Code from LineRecordReadeļ¼
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
and you can change it like this:
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new Text();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (getFilePosition() <= end && n<=3) {
newSize = in.readLine(key, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));//change value --> key
value =Text(n);
n++;
if(n ==3 )// we don't go to next until the N is three;
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
I think this can suit for you
Try not giving key:-
context.write(NullWritable.get(), new Text("Apache Hadoop - 1"));
context.write(NullWritable.get(), new Text("Apache Hadoop - 2"));
context.write(NullWritable.get(), new Text("Apache Hadoop - 3"));
There is this sample record,
100,1:2:3
Which I want to normalize as,
100,1
100,2
100,3
A colleague of mine wrote a pig script to achieve this and my MapReduce code took more time. I was using the default TextInputformat before. But to improve performance, I decided to write a custom Input format class, with a custom RecordReader. Taking the LineRecordReader class as reference, I tried to write the following code.
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
import com.normalize.util.Splitter;
public class NormalRecordReader extends RecordReader<Text, Text> {
private long start;
private long pos;
private long end;
private LineReader in;
private int maxLineLength;
private Text key = null;
private Text value = null;
private Text line = null;
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
in = new LineReader(fileIn, job);
this.pos = start;
}
public boolean nextKeyValue() throws IOException {
int newSize = 0;
if (line == null) {
line = new Text();
}
while (pos < end) {
newSize = in.readLine(line);
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
System.out.println("Skipped line of size " + newSize + " at pos " + (pos - newSize));
}
Splitter splitter = new Splitter(line.toString(), ",");
List<String> split = splitter.split();
if (key == null) {
key = new Text();
}
key.set(split.get(0));
if (value == null) {
value = new Text();
}
value.set(split.get(1));
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
#Override
public Text getCurrentKey() {
return key;
}
#Override
public Text getCurrentValue() {
return value;
}
/**
* Get the progress within the split
*/
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (pos - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
if (in != null) {
in.close();
}
}
}
Though this works, but I haven't seen any performance improvement. Here I am breaking the record at "," and setting the 100 as key and 1,2,3 as value. I only call the mapper which does the following:
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
try {
Splitter splitter = new Splitter(value.toString(), ":");
List<String> splits = splitter.split();
for (String split : splits) {
context.write(key, new Text(split));
}
} catch (IndexOutOfBoundsException ibe) {
System.err.println(value + " is malformed.");
}
}
The splitter class is used to split the data, as I found String's splitter to be slower. The method is:
public List<String> split() {
List<String> splitData = new ArrayList<String>();
int beginIndex = 0, endIndex = 0;
while(true) {
endIndex = dataToSplit.indexOf(delim, beginIndex);
if(endIndex == -1) {
splitData.add(dataToSplit.substring(beginIndex));
break;
}
splitData.add(dataToSplit.substring(beginIndex, endIndex));
beginIndex = endIndex + delimLength;
}
return splitData;
}
Can the code be improved in any way?
Let me summarize here what I think you can improve instead of in the comments:
As explained, currently you are creating a Text object several times per record (number of times will be equal to your number of tokens). While it may not matter too much for small input, this can be a big deal for decently sized jobs. To fix that, do the following:
private final Text text = new Text();
public void map(Text key, Text value, Context context) {
....
for (String split : splits) {
text.set(split);
context.write(key, text);
}
}
For your splitting, what you're doing right now is for every record allocating a new array, populating this array, and then iterating over this array to write your output. Effectively you don't really need an array in this case since you're not maintaining any state. Using the implementation of the split method you provided, you only need to make one pass on the data:
public void map(Text key, Text value, Context context) {
String dataToSplit = value.toString();
String delim = ":";
int beginIndex = 0;
int endIndex = 0;
while(true) {
endIndex = dataToSplit.indexOf(delim, beginIndex);
if(endIndex == -1) {
text.set(dataToSplit.substring(beginIndex));
context.write(key, text);
break;
}
text.set(dataToSplit.substring(beginIndex, endIndex));
context.write(key, text);
beginIndex = endIndex + delim.length();
}
}
I don't really see why you write your own InputFormat, it seems that KeyValueTextInputFormat is exactly what you need and has probably been already optimized. Here is how you use it:
conf.set("key.value.separator.in.input.line", ",");
job.setInputFormatClass(KeyValueTextInputFormat.class);
Based on your example, the key for each record seems to be an integer. If that's always the case, then using a Text as your mapper input key is not optimal and it should be an IntWritable or maybe even a ByteWritable depending on what's in your data.
Similarly, you want want to use an IntWritable or ByteWritable as your mapper output key and output value.
Also, if you want some meaningful benchmark, you should test on a bigger dataset, like a few Gbs if possible. 1 minute tests are not really meaningful, especially in the context of distributed systems. 1 job may run quicker than another one on a small input, but the trend may be reverted for bigger inputs.
That being said, you should also know that Pig does a lot of optimizations behind the hood when translating to Map/Reduce, so I'm not too surprised that it runs faster than your Java Map/Reduce code and I've seen that in the past. Try the optimizations I suggested, if it's still not fast enough here is a link on profiling your Map/Reduce jobs with a few more useful tricks (especially tip 7 on profiling is something I've found useful).
I have a huge text file and I wanted to split the file so that each chunk has 5 lines. I implemented my own GWASInputFormat and GWASRecordReader classes. However my question is, in the following code(which I copied from http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/), inside the initialize() method I have the following lines
FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
My question is, Is the file already split by the time the initialize() method is called in my GWASRecordReader class? I thought that I was doing it(the split) in the GWASRecordReader class. Let me know if my thought process is right here.
package com.test;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
public class GWASRecordReader extends RecordReader<LongWritable, Text> {
private final int NLINESTOPROCESS = 5;
private LineReader in;
private LongWritable key;
private Text value = new Text();
private long start = 0;
private long pos = 0;
private long end = 0;
private int maxLineLength;
public void close() throws IOException {
if(in != null) {
in.close();
}
}
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
public float getProgress() throws IOException, InterruptedException {
if(start == end) {
return 0.0f;
}
else {
return Math.min(1.0f, (pos - start)/(float) (end - start));
}
}
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
this.maxLineLength = conf.getInt("mapred.linerecordreader.maxlength",Integer.MAX_VALUE);
FileSystem fs = file.getFileSystem(conf);
start = split.getStart();
end = start + split.getLength();
System.out.println("---------------SPLIT LENGTH---------------------" + split.getLength());
boolean skipFirstLine = false;
FSDataInputStream filein = fs.open(split.getPath());
if(start != 0) {
skipFirstLine = true;
--start;
filein.seek(start);
}
in = new LineReader(filein, conf);
if(skipFirstLine) {
start += in.readLine(new Text(),0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
}
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
value.clear();
final Text endline = new Text("\n");
int newSize = 0;
for(int i=0; i<NLINESTOPROCESS;i++) {
Text v = new Text();
while( pos < end) {
newSize = in.readLine(v ,maxLineLength, Math.max((int)Math.min(Integer.MAX_VALUE, end - pos), maxLineLength));
value.append(v.getBytes(), 0, v.getLength());
value.append(endline.getBytes(),0,endline.getLength());
if(newSize == 0) {
break;
}
pos += newSize;
if(newSize < maxLineLength) {
break;
}
}
}
if(newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
}
Yes, the input file will already be split. It basically goes like this:
your input file(s) -> InputSplit -> RecordReader -> Mapper...
Basically, InputSplit breaks the input into chunks, RecordReader breaks these chunks into key/value pairs. Note that InputSplit and RecordReader will be determined by the InputFormat you use. For example, TextInputFormat uses FileSplit to break apart the input, then LineRecordReader which processes each individual line with the position as the key, and the line itself as the value.
So in your GWASInputFormat you'll need to look into what kind of FileSplit you use to see what it's passing to GWASRecordReader.
I would suggest looking into NLineInputFormat which "splits N lines of input as one split". It may be able to do exactly what you are trying to do yourself.
If you're trying to get 5 lines at a time as the value, and the line number of the first as a key, I would say you could do this with a customized NLineInputFormat and custom LineRecordReader. You don't need to worry as much about the input split I think, since the input format can split it into those 5 line chunks. Your RecordReader would be very similar to LineRecordReader, but instead of getting the byte position of the start of the chunk, you would get the line number. So the code would be almost identical except for that small change. So you could essentially copy and paste NLineInputFormat and LineRecordReader but then have the input format use your record reader that gets the line number. The code would be very similar.