I have a text file like with tab delimiter
20001204X00000 Accident 10 9 6 Hyd
20001204X00001 Accident 8 7 vzg 2
20001204X00002 Accident 10 7 sec 1
20001204X00003 Accident 23 9 kkd 23
I want to get the output flight id,total number of passengers, here I have to sum all numerical columns values for total number of passengers Like this
20001204X00000 25
20001204X00001 17
20001204X00002 18
20001204X00003 55
When try to add the four numerical columns I got NullPointer exception, please help how to avoid nullPointerException and how to replace the null or white space values with zero
Actually This is Hadoop Map reduce Java Code
package com.flightsdamage.mr;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FlightsDamage {
public static class FlightsMaper extends Mapper<LongWritable, Text, Text, LongWritable> {
LongWritable pass2;
#Override
protected void map(LongWritable key, Text value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException,NumberFormatException,NullPointerException {
String line = value.toString();
String[] column=line.split("|");
Text word=new Text();
word.set(column[0]);
String str = "n";
try {
long a = Long.parseLong(str);
long a1=Long.parseLong("col1[]");
long a2=Long.parseLong("col2[]");
long a3=Long.parseLong("col3[]");
long a4=Long.parseLong("col4[]");
long sum = a1+a2+a3+a4;
LongWritable pass0 = new LongWritable(a1);
LongWritable pass = new LongWritable(a2);
LongWritable pass1 = new LongWritable(a3);
LongWritable pass3 = new LongWritable(a4);
pass2 = new LongWritable(sum);
} catch (Exception e) {
// TODO: handle exception
}finally{
context.write(word,pass2);
}
}
}
public static void main(String[] args)throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Flights MR");
job.setJarByClass(FlightsDamage.class);
job.setMapperClass(FlightsMaper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
//FileInputFormat.addInputPath(job, new Path("/home/node1/data-AviationData.txt"));
FileInputFormat.addInputPath(job, new Path("/home/node1/Filghtdamage.txt"));
FileOutputFormat.setOutputPath(job, new Path("/home/node1/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You need to check if the string is of numeric type before parsing it. Like:
int value = 0;
if (StringUtils.isNumeric(str)) {
value = Integer.parseInt(str);
}
If the input string is non numeric (be it null or other non numeric value), StringUtils.isNumeric() will return false and the variable will have 0 as default value.
Here is a simple program which demonstrate the usage of StringUtils.isNumeric()
Test Class:
import org.apache.commons.lang3.StringUtils;
public class LineParse {
public static void main(String[] args) {
String[] input = {
"20001204X00000\tAccident\t10\t9\t6\tHyd",
"20001204X00001\tAccident\t\t8\t7\tvzg\t2",
"20001204X00002\tAccident\t10\t7\t\tsec\t1",
"20001204X00003\tAccident\t23\t\t9\tkkd\t23"
};
StringBuilder output = new StringBuilder();
for (String line : input) {
int sum = 0;
String[] tokens = line.split("\t");
if (tokens.length > 0) {
output.append(tokens[0]);
output.append("\t");
for (int i = 1;i < tokens.length;i++) {
// Check if String is of type numeric.
if (StringUtils.isNumeric(tokens[i])) {
sum += Integer.parseInt(tokens[i]);
}
}
}
output.append(sum);
output.append("\n");
}
System.out.println(output.toString());
}
}
Output:
20001204X00000 25
20001204X00001 17
20001204X00002 18
20001204X00003 55
I have assumed that all the numbers will be Integer. Otherwise use Double.parseDouble().
Related
I have the following code in hadoop and when it runs it produces the output of the mapper as the output of the reducer. The reducer basically does nothing. The 2 input files are in the form:
File A: Jan-1 #starwars,17115 (Each line is like this one.) VALUE is the number 17115.
File B: #starwars,2017/1/1 5696 (Each line is like this one.) VALUE is the number 5696.
Mapper class processes these file and outputs(only the bold letters):
JAN #STARWARS 17115/A where KEY: JAN #STARWARS
JAN #STARWARS 5696/B where KEY: JAN #STARWARS
The reducer is supposed to do the following:
All the same keys go to one reducer, correct me if i'm wrong i'm new to hadoop and each reducer splits the value to 2 parts : the key and the value
KEY: A, VALUE 17115
KEY: B, VALUE 5696
For the moment it should just adds all the values without caring if it's coming from A or B and writes(only bold):
JAN #STARWARS 22.811 (22.811 = 17115+5696)
So why does it write the mappers output without the reducer doing what it is supposed to do?
I din't set the num of reducer to zero.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Partitioner;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text>{
//private final static IntWritable result = new IntWritable();
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString(),"\n");
while (itr.hasMoreTokens()) {
String nextWord = itr.nextToken().toUpperCase();
//System.out.println("'"+nextWord+"'");
if(isFromPlatformB(nextWord)){
//Procedure for words of Platform B.
String[] split1 = nextWord.split("(,)|(/)|(\\s)");
String seriesTitle = split1[0];
String numOfMonth = split1[2];
String numOfDay = split1[3];
String number = split1[4];//VALUE
int monthInt = Integer.parseInt(numOfMonth);
String monthString;
switch (monthInt) {
case 1: monthString = "JAN";
break;
case 2: monthString = "FEB";
break;
case 3: monthString = "MAR";
break;
case 4: monthString = "APR";
break;
case 5: monthString = "MAY";
break;
case 6: monthString = "JUN";
break;
case 7: monthString = "JUL";
break;
case 8: monthString = "AUG";
break;
case 9: monthString = "SEP";
break;
case 10: monthString = "OCT";
break;
case 11: monthString = "NOV";
break;
case 12: monthString = "DEC";
break;
default: monthString = "ERROR";
break;
}
//result.set(numberInt);
word.set(monthString + " " + seriesTitle);
System.out.println("key: "+monthString + " " + seriesTitle + ", value: "+number+"/B");
context.write(word, new Text(number + "/B"));
//FORMAT : <KEY,VALUE/B>
}
else{
//Procedure for words of Platform A.
String[] split5 = nextWord.split("(-)|( )|(,)");
String month = split5[0];
String seriesTitle = split5[2];
String value2 = split5[3];//OUTVALUE
String finalWord = month + " " + seriesTitle;//OUTKEY KEY: <APR #WESTWORLD>
word.set(finalWord);
//result.set(valueInt);
System.out.println("key: "+finalWord + ", value: "+value2+"/A");
context.write(word, new Text(value2 + "/A"));
//FORMAT : <KEY,VALUE/A>
}
}
}
/*
*This method takes the next token and returns true if the token is taken from platform B file,
*Or it returns false if the token comes from platform A file.
*
*/
public boolean isFromPlatformB(String nextToken){
// B platform has the form of : "#WestWorld ,2017/1/2){
if(nextToken.charAt(0) == '#'){
return true;
}
return false;
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,Text> {
//private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (Text val : values) {
String valToString = val.toString();
String[] split = valToString.split("/");
//String keyOfValue;
String valueOfValue;
int intValueOfValue = 0;
// FORMAT : <KEY,VALUE/platform> [<KEY,VALUE>,VALUE = <key,value>]
// [0] [1]
if(split.length>1){
//keyOfValue = split[1];
valueOfValue = split[0];
//System.out.println(key);
//System.out.println(valueOfValue);
//System.out.println(keyOfValue);
intValueOfValue = Integer.parseInt(valueOfValue);
/*if(keyOfValue.equals("A")){//If value is from platform A
counterForPlatformA += intValueOfValue;
System.out.println("KEY = 'A' " + "VALUE :" +intValueOfValue);
System.out.println("counter A: "+ counterForPlatformA +"|| counter B: "+ counterForPlatformB + "||----||");
}
else if(keyOfValue.equals("B")){//If value is from platform B
counterForPlatformB += intValueOfValue;
System.out.println("KEY = 'B' " + "VALUE :" +intValueOfValue);
System.out.println("counter A: "+ counterForPlatformA +"|| counter B: "+ counterForPlatformB + "||----||");
}
else{
//ERROR
System.out.println("Not equal to A or B");
}*/
}
sum += intValueOfValue;
}
context.write(key, new Text(sum));
}
}
public static void main(String[] args) throws Exception{
if (args.length != 3 ){
System.err.println ("Usage :<inputlocation1> <inputlocation2> <outputlocation> >");
System.exit(0);
}
Configuration conf = new Configuration();
String[] files=new GenericOptionsParser(conf,args).getRemainingArgs();
Path input1=new Path(files[0]);
Path input2=new Path(files[1]);
Path output=new Path(files[2]);
//If OUTPUT already exists -> Delete it
FileSystem fs = FileSystem.get(conf);
if(fs.exists(output)){
fs.delete(output, true);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, input1, TextInputFormat.class);
MultipleInputs.addInputPath(job, input2, TextInputFormat.class);
FileOutputFormat.setOutputPath(job, output);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
It looks like your reducer takes in a pair of Text objects and outputs Text. If this is the case, it looks like you have a few problems:
In your main you have:
job.setOutputValueClass(IntWritable.class) which should probably be job.setOutputValueClass(Text.class)
You are also defining your reducer as:
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,Text> it should be public static class IntSumReducer extends Reducer<Text,Text,Text,Text>
The reducer is receiving Text values, not IntWritables.
It was the combiner finally. If you set your reducer as your combiner too then you can't have different types between your mapper and reducer.
I'm getting ArrayIndexOutofBoundsException next to String temp = word[5]; in my mapper.
I've researched this and I know what the error is coming from (when the input data is empty or the length is less or more than the index specified in the code. My data has some empty cell values)
I've tried to catch the array index error using the following code but it still gives me error.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class AvgMaxTempMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, DoubleWritable> {
public void map(LongWritable key, Text value, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
if(line != null && !line.isEmpty() && str.matches(".*\\d+.*"));
String [] word = line.split(",");
String month = word[3];
String temp = word[5];
if (temp.length() > 1 && temp.length() < 5){
Double avgtemp = Double.parseDouble(temp);
output.collect(new Text(month), new DoubleWritable(avgtemp));
}
}
}
If you could please give me any hints or tips to whether the error is in this code or I should look somewhere else, that would save a lot of stress!
By throwing the exception in the method signature, you're basically causing the entire mapper to stop whenever it encounters a single "bad" line of data. What you actually want to do is have the mapper ignore that line of data but keep processing other lines.
You should check the length of word[] immediately after split(). If it's not long enough, stop processing that line. You'll also want to check that month and temp are valid after you've extracted them. How about:
String [] word = line.split(",");
if (word == null || word.length < 6) {
break;
}
String month = word[3];
if (month != null) {
break;
}
String temp = word[5];
if (temp != null && temp.length() > 1 && temp.length() < 5) {
try {
Double avgtemp = Double.parseDouble(temp);
} catch (NumberFormatException ex) {
//Log that you've seen a dodgy temperature
break;
}
output.collect(new Text(month), new DoubleWritable(avgtemp));
}
It's very important to validate data in MapReduce jobs, as you can never guarantee what you'll get as input.
You might also want to look at ApacheCommons StringUtils and ArrayUtils classes - they provide methods such as StringUtils.isEmpty(temp) and ArrayUtils.isEmpty(word) that will neaten up the above.
I would recommend using a custom counter instead, which you will increase every time you find an empty cell. This will give you a picture of how many such lines exist in your data.
Along with some other efficiency modifications, my suggestion is the following:
import java.io.IOException; //do you still need this?
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class AvgMaxTempMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, DoubleWritable> {
public static enum STATS {MISSING_VALUE};
private Text outKey = new Text();
private DoubleWritable outValue = new DoubleWritable();
public void map(LongWritable key, Text value, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
if(line.matches(".*\\d+.*"));
String [] word = line.split(",");
if (word.length < 6) { //or whatever else you consider expected
reporter.incrCounter(STATS.MISSING_VALUE,1); //you can also print/log an error message if you like
return;
}
String month = word[3];
String temp = word[5];
if (temp.length() > 1 && temp.length() < 5){
Double avgtemp = Double.parseDouble(temp);
outKey.set(month);
outValue.set(avgtemp);
output.collect(outKey, outValue);
} //you were missing this '}'
}
}
}
So my class's lab assingment is to read in a text file with a city's name and some temperatures on one line (with any amount of lines) ex:
Montgomery 15 9.5 17
and print the name and the average of the temperatures. When printing out the name has to be to the left on the line, and the average should be printed with two digits after the decimal point, to the right on the line. I can assume no name is 30+ characters, and that the average can be printed in a field 10 characters wide (including the decimal point).
Here's what I have so far
import java.util.Scanner;
import java.io.*;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Formatter;
public class readFile2
{
public static void main(String[] args) throws IOException
{
FileReader a = new FileReader("places.txt");
StreamTokenizer b = new StreamTokenizer(a);
ArrayList<Double> c = new ArrayList<Double>();
/*System.out.println(
String.format("%-10s:%10s", "ABCD", "ZYXW"));
**I found this to format an output. */
double count = 0;
while(b.nextToken() != b.TT_EOF);
{
if(b.ttype == b.TT_WORD)
{
System.out.print(b.sval);
}
if(b.ttype == b.TT_NUMBER)
{
c.add(b.nval);
count ++;
}
double totaltemp = 0; // I need to figure out how to put
// this inside of an if statement
for(int i = 0; i < c.size(); i++) // that can tell when it reaches
{ // the last number/the end of the line
totaltemp = c.get(i) + totaltemp; //
count++; //
} //
//
System.out.print(totaltemp/count); //
}
}
}
The second part is to modify the program so that the name of a cty may consist of more than one word (e.g., `New York')
I really appreciate any and all help and advice :)
There is this sample record,
100,1:2:3
Which I want to normalize as,
100,1
100,2
100,3
A colleague of mine wrote a pig script to achieve this and my MapReduce code took more time. I was using the default TextInputformat before. But to improve performance, I decided to write a custom Input format class, with a custom RecordReader. Taking the LineRecordReader class as reference, I tried to write the following code.
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
import com.normalize.util.Splitter;
public class NormalRecordReader extends RecordReader<Text, Text> {
private long start;
private long pos;
private long end;
private LineReader in;
private int maxLineLength;
private Text key = null;
private Text value = null;
private Text line = null;
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
in = new LineReader(fileIn, job);
this.pos = start;
}
public boolean nextKeyValue() throws IOException {
int newSize = 0;
if (line == null) {
line = new Text();
}
while (pos < end) {
newSize = in.readLine(line);
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
System.out.println("Skipped line of size " + newSize + " at pos " + (pos - newSize));
}
Splitter splitter = new Splitter(line.toString(), ",");
List<String> split = splitter.split();
if (key == null) {
key = new Text();
}
key.set(split.get(0));
if (value == null) {
value = new Text();
}
value.set(split.get(1));
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
#Override
public Text getCurrentKey() {
return key;
}
#Override
public Text getCurrentValue() {
return value;
}
/**
* Get the progress within the split
*/
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (pos - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
if (in != null) {
in.close();
}
}
}
Though this works, but I haven't seen any performance improvement. Here I am breaking the record at "," and setting the 100 as key and 1,2,3 as value. I only call the mapper which does the following:
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
try {
Splitter splitter = new Splitter(value.toString(), ":");
List<String> splits = splitter.split();
for (String split : splits) {
context.write(key, new Text(split));
}
} catch (IndexOutOfBoundsException ibe) {
System.err.println(value + " is malformed.");
}
}
The splitter class is used to split the data, as I found String's splitter to be slower. The method is:
public List<String> split() {
List<String> splitData = new ArrayList<String>();
int beginIndex = 0, endIndex = 0;
while(true) {
endIndex = dataToSplit.indexOf(delim, beginIndex);
if(endIndex == -1) {
splitData.add(dataToSplit.substring(beginIndex));
break;
}
splitData.add(dataToSplit.substring(beginIndex, endIndex));
beginIndex = endIndex + delimLength;
}
return splitData;
}
Can the code be improved in any way?
Let me summarize here what I think you can improve instead of in the comments:
As explained, currently you are creating a Text object several times per record (number of times will be equal to your number of tokens). While it may not matter too much for small input, this can be a big deal for decently sized jobs. To fix that, do the following:
private final Text text = new Text();
public void map(Text key, Text value, Context context) {
....
for (String split : splits) {
text.set(split);
context.write(key, text);
}
}
For your splitting, what you're doing right now is for every record allocating a new array, populating this array, and then iterating over this array to write your output. Effectively you don't really need an array in this case since you're not maintaining any state. Using the implementation of the split method you provided, you only need to make one pass on the data:
public void map(Text key, Text value, Context context) {
String dataToSplit = value.toString();
String delim = ":";
int beginIndex = 0;
int endIndex = 0;
while(true) {
endIndex = dataToSplit.indexOf(delim, beginIndex);
if(endIndex == -1) {
text.set(dataToSplit.substring(beginIndex));
context.write(key, text);
break;
}
text.set(dataToSplit.substring(beginIndex, endIndex));
context.write(key, text);
beginIndex = endIndex + delim.length();
}
}
I don't really see why you write your own InputFormat, it seems that KeyValueTextInputFormat is exactly what you need and has probably been already optimized. Here is how you use it:
conf.set("key.value.separator.in.input.line", ",");
job.setInputFormatClass(KeyValueTextInputFormat.class);
Based on your example, the key for each record seems to be an integer. If that's always the case, then using a Text as your mapper input key is not optimal and it should be an IntWritable or maybe even a ByteWritable depending on what's in your data.
Similarly, you want want to use an IntWritable or ByteWritable as your mapper output key and output value.
Also, if you want some meaningful benchmark, you should test on a bigger dataset, like a few Gbs if possible. 1 minute tests are not really meaningful, especially in the context of distributed systems. 1 job may run quicker than another one on a small input, but the trend may be reverted for bigger inputs.
That being said, you should also know that Pig does a lot of optimizations behind the hood when translating to Map/Reduce, so I'm not too surprised that it runs faster than your Java Map/Reduce code and I've seen that in the past. Try the optimizations I suggested, if it's still not fast enough here is a link on profiling your Map/Reduce jobs with a few more useful tricks (especially tip 7 on profiling is something I've found useful).
I have a huge text file and I wanted to split the file so that each chunk has 5 lines. I implemented my own GWASInputFormat and GWASRecordReader classes. However my question is, in the following code(which I copied from http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/), inside the initialize() method I have the following lines
FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
My question is, Is the file already split by the time the initialize() method is called in my GWASRecordReader class? I thought that I was doing it(the split) in the GWASRecordReader class. Let me know if my thought process is right here.
package com.test;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
public class GWASRecordReader extends RecordReader<LongWritable, Text> {
private final int NLINESTOPROCESS = 5;
private LineReader in;
private LongWritable key;
private Text value = new Text();
private long start = 0;
private long pos = 0;
private long end = 0;
private int maxLineLength;
public void close() throws IOException {
if(in != null) {
in.close();
}
}
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
public float getProgress() throws IOException, InterruptedException {
if(start == end) {
return 0.0f;
}
else {
return Math.min(1.0f, (pos - start)/(float) (end - start));
}
}
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
this.maxLineLength = conf.getInt("mapred.linerecordreader.maxlength",Integer.MAX_VALUE);
FileSystem fs = file.getFileSystem(conf);
start = split.getStart();
end = start + split.getLength();
System.out.println("---------------SPLIT LENGTH---------------------" + split.getLength());
boolean skipFirstLine = false;
FSDataInputStream filein = fs.open(split.getPath());
if(start != 0) {
skipFirstLine = true;
--start;
filein.seek(start);
}
in = new LineReader(filein, conf);
if(skipFirstLine) {
start += in.readLine(new Text(),0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
}
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
value.clear();
final Text endline = new Text("\n");
int newSize = 0;
for(int i=0; i<NLINESTOPROCESS;i++) {
Text v = new Text();
while( pos < end) {
newSize = in.readLine(v ,maxLineLength, Math.max((int)Math.min(Integer.MAX_VALUE, end - pos), maxLineLength));
value.append(v.getBytes(), 0, v.getLength());
value.append(endline.getBytes(),0,endline.getLength());
if(newSize == 0) {
break;
}
pos += newSize;
if(newSize < maxLineLength) {
break;
}
}
}
if(newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
}
Yes, the input file will already be split. It basically goes like this:
your input file(s) -> InputSplit -> RecordReader -> Mapper...
Basically, InputSplit breaks the input into chunks, RecordReader breaks these chunks into key/value pairs. Note that InputSplit and RecordReader will be determined by the InputFormat you use. For example, TextInputFormat uses FileSplit to break apart the input, then LineRecordReader which processes each individual line with the position as the key, and the line itself as the value.
So in your GWASInputFormat you'll need to look into what kind of FileSplit you use to see what it's passing to GWASRecordReader.
I would suggest looking into NLineInputFormat which "splits N lines of input as one split". It may be able to do exactly what you are trying to do yourself.
If you're trying to get 5 lines at a time as the value, and the line number of the first as a key, I would say you could do this with a customized NLineInputFormat and custom LineRecordReader. You don't need to worry as much about the input split I think, since the input format can split it into those 5 line chunks. Your RecordReader would be very similar to LineRecordReader, but instead of getting the byte position of the start of the chunk, you would get the line number. So the code would be almost identical except for that small change. So you could essentially copy and paste NLineInputFormat and LineRecordReader but then have the input format use your record reader that gets the line number. The code would be very similar.