Hadoop Custom InputFileFormat producing empty results - java

So I'm trying to import baseball event files from Retrosheet.org into Hadoop. Each game entry follows the following format, with each file containing a season's worth of game entries (this is an incomplete record just for example, some items removed for redundancy and to save space):
id,BOS192704230
version,1
info,inputprogvers,"version 7RS(19) of 07/07/92"
info,visteam,WS1
start,judgj101,"Joe Judge",0,5,3
start,myerb103,"Buddy Myer",0,6,6
start,blueo102,"Ossie Bluege",0,7,5
play,4,0,myerb103,??,,S/BG
play,4,0,blueo102,??,,CS2(2E4)
play,4,0,blueo102,??,,7/FL
play,4,0,ruelm101,??,,63
play,4,0,crowg102,??,,NP
sub,wests101,"Sam West",0,9,11
play,4,0,wests101,??,,K/C
play,4,1,wannp101,??,,NP
sub,liseh101,"Hod Lisenbee",0,9,1
play,4,1,wannp101,??,,W
play,4,1,rothj101,??,,CS2(26)
play,4,1,rothj101,??,,7/F
play,4,1,tobij101,??,,5/P
play,5,0,rices101,??,,6/P
data,er,crowg102,4
data,er,liseh101,0
data,er,braxg101,1
data,er,marbf101,0
data,er,harrs101,3
I'm making my first pass at importing this into Hadoop, and am having trouble implementing a proper custom InputFileFormat to successfully read such a record. I've been attempting to split the files on the first line of each game record (indicated by "id," followed by a team, season, date, and game code) using the regex "id,[A-Z]{3}[0-9]{9}". When I output this (I'm using a SequenceFile output, but both the SequenceFile and regular Text file outputs return the same result), I get an empty result file. Any point in the right direction would be extremely helpful. The code I've got so far is based on a template found here: http://dronamk.blogspot.com/2013/03/regex-custom-input-format-for-hadoop.html. I'm using essentially the same code posted there, just compiling the abovementioned regex instead of the included expression.
The class in question:
package project.baseball;
import java.io.IOException;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class RegexInputFormat extends
InputFormat<LongWritable, TextArrayWritable> {
public Pattern pattern = Pattern.compile("id,[A-Z]{3}[0-9]{9}");
private TextInputFormat textIF = new TextInputFormat();
#Override
public List<InputSplit> getSplits(JobContext context) throws IOException,
InterruptedException {
return textIF.getSplits(context);
}
#Override
public RecordReader<LongWritable, TextArrayWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException {
RegexRecordReader reader = new RegexRecordReader();
if (pattern == null) {
throw new IllegalStateException(
"No pattern specified - unable to create record reader");
}
reader.setPattern(pattern);
return reader;
}
public static class RegexRecordReader extends
RecordReader<LongWritable, TextArrayWritable> {
private LineRecordReader lineRecordReader = new LineRecordReader();
private Pattern pattern;
TextArrayWritable value = new TextArrayWritable();
public void setPattern(Pattern pattern2) {
pattern = pattern2;
}
#Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
lineRecordReader.initialize(split, context);
}
#Override
public boolean nextKeyValue() throws IOException, InterruptedException {
while (lineRecordReader.nextKeyValue()) {
Matcher matcher;
matcher = pattern.matcher(lineRecordReader.getCurrentValue()
.toString());
if (matcher.find()) {
int fieldCount;
Text[] fields;
fieldCount = matcher.groupCount();
fields = new Text[fieldCount];
for (int i = 0; i < fieldCount; i++) {
fields[i] = new Text(matcher.group(i + 1));
}
value.setFields(fields);
return true;
}
}
return false;
}
#Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return lineRecordReader.getCurrentKey();
}
#Override
public TextArrayWritable getCurrentValue() throws IOException,
InterruptedException {
return value;
}
#Override
public float getProgress() throws IOException, InterruptedException {
return lineRecordReader.getProgress();
}
#Override
public void close() throws IOException {
lineRecordReader.close();
}
}
}

Your regex may miss the context ie what is around the line to split with.
Try this instead:
(.*)(id,([A-Z]{3}[0-9]{9}))(.*)

Related

Argument passing using Cucumber between two methods in Java

I am a newcomer to Cucumber and I am having a hard time trying to understand how to pass data between two methods. I keep reading about data tables but I only see examples on how to use data that is already listed in the table in feature. When I run my code I get the error:
Step [^Send Results$] is defined with 3 parameters at 'cucumebr.test.addResult(int,int,String,Integer>>) in file:/Users/lcren1026/eclipse-workspace/cucumebr/target/classes/'.However, the gherkin step has 0 arguments.
What am I trying to do is pass a Map in an Arraylist between two methods using the data gathered in Selenium. Below is my code:
package cucumebr;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import testrail.APIException;
import org.json.simple.parser.ParseException;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import cucumber.api.java.en.Then;
import cucumber.api.java.en.Given;
import cucumber.api.java.en.When;
public class test
{
public static WebDriver driver;
String FFDriverDirectory = "directory";
String FFDriverEXE = "webdriver.gecko.driver";
ArrayList Results = new ArrayList<>();
#Given("^Open the browser$")
public void openBrowser() throws IOException, APIException, InterruptedException
{
System.setProperty(FFDriverEXE, FFDriverDirectory);
driver = new FirefoxDriver();
driver.navigate().to("https://www.google.com/");
Thread.sleep(5000);
}
#When("^verify logo$")
public void verifyLogo() throws IOException, APIException, ParseException
{
if(driver.findElement(By.xpath("//*[#id=\"hplogo\"]")).isDisplayed()) {
addResult(120254, 1, Results);
} else
{
addResult(120254, 5, Results);
}
}
#Then("^verify btn$")
public void verifyBtn() throws IOException, APIException, ParseException
{
if(driver.findElement(By.xpath("//*[#id=\"tsf\"]/div[2]/div/div[3]/center/input[1]")).isDisplayed()) {
addResult(120255, 1, Results);
} else
{
addResult(120255, 5, Results);
}
}
#Then("^Send Results$")
public void addResult(int testCaseId, int status, ArrayList<Map<String, Integer>> newResults
) throws IOException, ParseException {
int count = testCaseId;
if(testCaseId == count) {
Map myTestResults= new HashMap<String, Integer>(){{put("suite", testCaseId); put("milestone", 179); put("status", status);}};
System.out.println(myTestResults);
newResults.add(myTestResults);
count++ ;
}
System.out.println( newResults);
}
#Then("^Print Results$")
public static void PrintResultForTestCase( ArrayList<Map<String, Integer>> newResults ) throws IOException, APIException {
System.out.println("This is the final result " + newResults);
}}
Here is the feature:
Feature: google
Scenario: Driver works
Given Open the browser
When verify logo
Then verify btn
Then Send Results
Then Print Results
The data is ArrayList> newResults and the methods are "addResult" and "PrintResultForTestCase".
Thanks in advance!
Your code:
public void addResult(int testCaseId, int status, ArrayList<Map<String, Integer>> newResults))
requires three parameters, however your feature file does not have any.
Here is an example how to use parameters (number, text or table)
Sample scenario:
Scenario: Test Scenario
Given Number parameter $1
And Text parameter "text"
And Table example
| Team name | Number of members |
| team1 | 2 |
| team2 | 5 |
Steps:
#Given("^Number parameter \\$(\\d+)$")
public void numberParameter(int number)
{
System.out.println("Number from step is: " + number);
}
#And("^Text parameter \"([^\"]*)\"$")
public void textParameter(String text)
{
System.out.println("Text from step is: " + text);
}
#And("^Table example$")
public void tableExample(DataTable table)
{
List<List<String>> data = table.raw();
// iterate through 'data' here to access rows in table
}
Read more about DataTable here.

How to add "text/plain" MIME type to DataHandler

I have been struggling with getting this test to work for awhile, the relevant code executes fine in production my assumption is that it has some additional configuration, lots of searching seems to be related specifically to email handling and additional libraries, I don't want to include anything else, what am I missing to link DataHandler to a relevant way of handling "text/plain" ?
Expected result: DataHandler allows me to stream the input "Value" back into a result.
Reproduce issue with this code:
import java.io.IOException;
import java.io.InputStream;
import javax.activation.CommandInfo;
import javax.activation.CommandMap;
import javax.activation.DataHandler;
import org.apache.commons.io.IOUtils;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class DataHandlerTest {
#Before
public void setUp() throws Exception {
}
#After
public void tearDown() throws Exception {
}
#Test
public void test() throws IOException {
printDefaultCommandMap();
DataHandler dh = new DataHandler("Value", "text/plain");
System.out.println("DataHandler commands:");
printDataHandlerCommands(dh);
dh.setCommandMap(CommandMap.getDefaultCommandMap());
System.out.println("DataHandler commands:");
printDataHandlerCommands(dh);
final InputStream in = dh.getInputStream();
String result = new String(IOUtils.toByteArray(in));
System.out.println("Returned String: " + result);
}
private void printDataHandlerCommands(DataHandler dh) {
CommandInfo[] infos = dh.getAllCommands();
printCommands(infos);
}
private void printDefaultCommandMap() {
CommandMap currentMap = CommandMap.getDefaultCommandMap();
String[] mimeTypes = currentMap.getMimeTypes();
System.out.println("Found " + mimeTypes.length + " MIME types.");
for (String mimeType : mimeTypes) {
System.out.println("Commands for: " + mimeType);
printCommands(currentMap.getAllCommands(mimeType));
}
}
private void printCommands(CommandInfo[] infos) {
for (CommandInfo info : infos) {
System.out.println(" Command Class: " +info.getCommandClass());
System.out.println(" Command Name: " + info.getCommandName());
}
}
}
Exception:
javax.activation.UnsupportedDataTypeException: no object DCH for MIME
type text/plain at
javax.activation.DataHandler.getInputStream(DataHandler.java:249)
Help much appreciated, I hope this is a well formed question!
========================
Update 25th February
I have found if i know I stored a String in DataHandler, then I can cast the result to String and return the object that was stored, example:
#Test
public void testGetWithoutStream() throws IOException {
String inputString = "Value";
DataHandler dh = new DataHandler(inputString, "text/plain");
String rawResult = (String) dh.getContent();
assertEquals(inputString, rawResult);
}
But the code under test uses an InputStream, so my 'real' tests still fail when executed locally.
Continuing my investigation and still hoping for someone's assistance/guidance on this one...
Answering my own question for anyone's future reference.
All credit goes to: https://community.oracle.com/thread/1675030?start=0
The principle here is that you need to provide DataHandler a factory that contains a DataContentHandler that will behave as you would like it to for your MIME type, setting this is via a static method that seems to affect all DataHandler instances.
I declared a new class (SystemDataHandlerConfigurator), which has a single public method that creates my factory and provides it the static DataHandler.setDataContentHandlerFactory() function.
My tests now work correctly if I do this before they run:
SystemDataHandlerConfigurator configurator = new SystemDataHandlerConfigurator();
configurator.setupCustomDataContentHandlers();
SystemDataHandlerConfigurator
import java.io.IOException;
import javax.activation.*;
public class SystemDataHandlerConfigurator {
public void setupCustomDataContentHandlers() {
DataHandler.setDataContentHandlerFactory(new CustomDCHFactory());
}
private class CustomDCHFactory implements DataContentHandlerFactory {
#Override
public DataContentHandler createDataContentHandler(String mimeType) {
return new BinaryDataHandler();
}
}
private class BinaryDataHandler implements DataContentHandler {
/** Creates a new instance of BinaryDataHandler */
public BinaryDataHandler() {
}
/** This is the key, it just returns the data uninterpreted. */
public Object getContent(javax.activation.DataSource dataSource) throws java.io.IOException {
return dataSource.getInputStream();
}
public Object getTransferData(java.awt.datatransfer.DataFlavor dataFlavor,
javax.activation.DataSource dataSource)
throws java.awt.datatransfer.UnsupportedFlavorException,
java.io.IOException {
return null;
}
public java.awt.datatransfer.DataFlavor[] getTransferDataFlavors() {
return new java.awt.datatransfer.DataFlavor[0];
}
public void writeTo(Object obj, String mimeType, java.io.OutputStream outputStream)
throws java.io.IOException {
if (mimeType == "text/plain") {
byte[] stringByte = (byte[]) ((String) obj).getBytes("UTF-8");
outputStream.write(stringByte);
}
else {
throw new IOException("Unsupported Data Type: " + mimeType);
}
}
}
}

Class cast exception in map reduce job when writing data to MySQL database

I am trying a map reduce job to load data in the mysql data base, however I am facing a class cast exception error, here is the procedure I use:
I first creates a DBOutputWritable class that implements Writable and DBWritable interfaces.
I then use my reduce job to write the data in the data base, however when I run the job, it fails by saying that there was an error:
java.lang.ClassCastException: com.amalwa.hadoop.DataBaseLoadMapReduce.DBOutputWritable cannot be cast to org.apache.hadoop.mapreduce.lib.db.DBWritable
at org.apache.hadoop.mapreduce.lib.db.DBOutputFormat$DBRecordWriter.write(DBOutputFormat.java:66)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:601)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.amalwa.hadoop.DataBaseLoadMapReduce.DBMapReduce$DBReducer.reduce(DBMapReduce.java:58)
at com.amalwa.hadoop.DataBaseLoadMapReduce.DBMapReduce$DBReducer.reduce(DBMapReduce.java:53)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:663)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:426)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I am having a hard time figuring out that if my class implements the interface that our required to write to a DB using a map reduce job, then why there is a class cast exception. I am implementing all the functions that are required.
Thanks.
DBOutputWritable
package com.amalwa.hadoop.DataBaseLoadMapReduce;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.lib.db.DBWritable;
public class DBOutputWritable implements Writable, DBWritable{
private String keyValue;
private String response;
public DBOutputWritable(String keyValue, String response){
this.keyValue = keyValue;
this.response = response;
}
public void readFields(DataInput resultSet) throws IOException {
}
public void readFields(ResultSet resultSet) throws SQLException {
keyValue = resultSet.getString(1);
response = resultSet.getString(2);
}
public void write(PreparedStatement preparedStatement) throws SQLException {
preparedStatement.setString(1, keyValue);
preparedStatement.setString(2, response);
}
public void write(DataOutput dataOutput) throws IOException {
}
}
Reducer:
public static class DBReducer extends Reducer<Text, Text, DBOutputWritable, NullWritable>{
public void reduce(Text requestKey, Iterable<Text> response, Context context){
for(Text responseSet: response){
try{
context.write(new DBOutputWritable(requestKey.toString(), responseSet.toString()), NullWritable.get());
}catch(IOException e){
System.err.println(e.getMessage());
}
catch(InterruptedException e){
System.err.println(e.getMessage());
}
}
}
}
Mapper:
public static class DBMapper extends Mapper{
public void map(LongWritable key, Text value, Context context) throws IOException{
String tweetInfo = value.toString();
String[] myTweetData = tweetInfo.split(",", 2);
String requestKey = myTweetData[0];
String response = myTweetData[1];
try {
context.write(new Text(requestKey), new Text(response));
} catch (InterruptedException e) {
System.err.println(e.getMessage());;
}
}
}
Main class:
public static void main(String[] args) throws Exception{
Configuration conf = new Configuration();
DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", "jdbc:mysql://ec2-54-152-254-194.compute-1.amazonaws.com/TWEETS", "user", "password");
Job job = new Job(conf);
job.setJarByClass(DBMapReduce.class);
job.setMapperClass(DBMapper.class);
job.setReducerClass(DBReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(DBOutputWritable.class);
job.setOutputValueClass(NullWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(DBOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
DBOutputFormat.setOutput(job, "TWEET_INFO", new String[] { "REQUESTKEY", "TWEET_DETAILS" });
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
It seems that you are mixing the old (org.apache.hadoop.mapred.*) and new (org.apache.hadoop.mapreduce.*) MapReduce APIs, and it is causing a conflict. My suspicion is your DBReducer class is extending the Reducer class from the new API but your DBOutputWritable is implementing DBWritable from the old API.
You should choose only one of those APIs across your implementation, which means that all imported MapReduce types begin with the same package prefix.
Note that typically you implement MapReduce interfaces when using the old API, and extend MapReduce base classes when using the new API.

Wordcount on (Key,Value) outputs from a Map Reduce

I have several (title , text ) ordered pairs obtained as an output from a MapReduce application in Hadoop using Java.
Now I would like to implement Word Count on the text field of these ordered pairs.
So my final output should look like :
(title-a , word-a-1 , count-a-1 , word-a-2 , count-a-2 ....)
(title-b , word-b-1, count-b-1 , word-b-2 , count-b-2 ....)
.
.
.
.
(title-x , word-x-1, count-x-1 , word-x-2 , count-x-2 ....)
To summarize , I want to implement wordcount separately on the output records from first mapreduce. Can someone suggest me a good way to do it or how I can chain a second map reduce job to create the above output or format it better ?
The following is the code , borrowed it from github and made some changes
package com.org;
import javax.xml.stream.XMLStreamConstants;//XMLInputFactory;
import java.io.*;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.DataOutputBuffer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.TaskAttemptID;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import javax.xml.stream.*;
public class XmlParser11
{
public static class XmlInputFormat1 extends TextInputFormat {
public static final String START_TAG_KEY = "xmlinput.start";
public static final String END_TAG_KEY = "xmlinput.end";
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) {
return new XmlRecordReader();
}
/**
* XMLRecordReader class to read through a given xml document to output
* xml blocks as records as specified by the start tag and end tag
*
*/
// #Override
public static class XmlRecordReader extends
RecordReader<LongWritable, Text> {
private byte[] startTag;
private byte[] endTag;
private long start;
private long end;
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private LongWritable key = new LongWritable();
private Text value = new Text();
#Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
startTag = conf.get(START_TAG_KEY).getBytes("utf-8");
endTag = conf.get(END_TAG_KEY).getBytes("utf-8");
FileSplit fileSplit = (FileSplit) split;
// open the file and seek to the start of the split
start = fileSplit.getStart();
end = start + fileSplit.getLength();
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
fsin = fs.open(fileSplit.getPath());
fsin.seek(start);
}
#Override
public boolean nextKeyValue() throws IOException,
InterruptedException {
if (fsin.getPos() < end) {
if (readUntilMatch(startTag, false)) {
try {
buffer.write(startTag);
if (readUntilMatch(endTag, true)) {
key.set(fsin.getPos());
value.set(buffer.getData(), 0,
buffer.getLength());
return true;
}
} finally {
buffer.reset();
}
}
}
return false;
}
#Override
public LongWritable getCurrentKey() throws IOException,
InterruptedException {
return key;
}
#Override
public Text getCurrentValue() throws IOException,
InterruptedException {
return value;
}
#Override
public void close() throws IOException {
fsin.close();
}
#Override
public float getProgress() throws IOException {
return (fsin.getPos() - start) / (float) (end - start);
}
private boolean readUntilMatch(byte[] match, boolean withinBlock)
throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
// end of file:
if (b == -1)
return false;
// save to buffer:
if (withinBlock)
buffer.write(b);
// check if we're matching:
if (b == match[i]) {
i++;
if (i >= match.length)
return true;
} else
i = 0;
// see if we've passed the stop point:
if (!withinBlock && i == 0 && fsin.getPos() >= end)
return false;
}
}
}
}
public static class Map extends Mapper<LongWritable, Text,Text, Text> {
#Override
protected void map(LongWritable key, Text value,
Mapper.Context context)
throws
IOException, InterruptedException {
String document = value.toString();
System.out.println("'" + document + "'");
try {
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(new
ByteArrayInputStream(document.getBytes()));
String propertyName = "";
String propertyValue = "";
String currentElement = "";
while (reader.hasNext()) {
int code = reader.next();
switch (code) {
case XMLStreamConstants.START_ELEMENT: //START_ELEMENT:
currentElement = reader.getLocalName();
break;
case XMLStreamConstants.CHARACTERS: //CHARACTERS:
if (currentElement.equalsIgnoreCase("title")) {
propertyName += reader.getText();
//System.out.println(propertyName);
} else if (currentElement.equalsIgnoreCase("text")) {
propertyValue += reader.getText();
//System.out.println(propertyValue);
}
break;
}
}
reader.close();
context.write(new Text(propertyName.trim()), new Text(propertyValue.trim()));
}
catch(Exception e){
throw new IOException(e);
}
}
}
public static class Reduce
extends Reducer<Text, Text, Text, Text> {
#Override
protected void setup(
Context context)
throws IOException, InterruptedException {
context.write(new Text("<Start>"), null);
}
#Override
protected void cleanup(
Context context)
throws IOException, InterruptedException {
context.write(new Text("</Start>"), null);
}
private Text outputKey = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context)
throws IOException, InterruptedException {
for (Text value : values) {
outputKey.set(constructPropertyXml(key, value));
context.write(outputKey, null);
}
}
public static String constructPropertyXml(Text name, Text value) {
StringBuilder sb = new StringBuilder();
sb.append("<property><name>").append(name)
.append("</name><value>").append(value)
.append("</value></property>");
return sb.toString();
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
conf.set("xmlinput.start", "<page>");
conf.set("xmlinput.end", "</page>");
Job job = new Job(conf);
job.setJarByClass(XmlParser11.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(XmlParser11.Map.class);
job.setReducerClass(XmlParser11.Reduce.class);
job.setInputFormatClass(XmlInputFormat1.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The wordcount code we find online does the word count of all files and gives the output . I want to do the wordcount for each of text fields separately. The above mapper is used to pull title and text from an XML document . Is there any way I can do the wordcount in the same mapper. If I do that , my next doubt is how do I pass it along with the already existing key value pairs (title,text) to the reducer. Sorry , I am not able to phrase my question properly but I guess the reader must have got some idea
I'm not sure if I have understood it properly. So I have many questions along with my answer.
First of all whoever has written this code is probably trying to show how to write a custom InputFormat to process xml data using MR. I don't know how it is related to your problem.
To summarize , I want to implement wordcount separately on the output records from first mapreduce. Can someone suggest me a good way to do it
Read the output file generated by first MR and do it.
or how I can chain a second map reduce job to create the above output or format it better ?
You can definitely chain jobs together in this fashion by writing multiple driver methods, one for each job. See this for more details and this for an example.
I want to do the wordcount for each of text fields separately.
What do you mean by separately?In the traditional wordcount program count of each word is calculated independently of the others.
Is there any way I can do the wordcount in the same mapper.
I hope you have understood the wordcount program properly. In the traditional wordcount program you read the input file, one line at a time, slit the line into words and then emit each word as the key with 1 as the value. All this happens inside the Mapper, which is essentially the same Mapper. And then the total count for each word is determined in the Reducer part of your job. If you wish to emit the words with their total counts from the mapper itself you have to read the whole file in the Mapper itself and do the counting. For that you need to set isSplittable in your InputFormat to false so that your input file is read as a whole and goes to just one Mapper.
When you emit something from Mapper and if it is not a Map only job, the output of your Mapper automatically goes to the Reducer. Do you need something else?
i suggested you can go with regular expression
and perform mapping and grouping.
in hadoop example jar file provide Grep class using this you can perform mapping of your hdfs data using regular expression. and group your maped data.

Implementation for CombineFileInputFormat Hadoop 0.20.205

Can someone please point out where I could find an implementation for CombineFileInputFormat (org. using Hadoop 0.20.205? this is to create large splits from very small log files (text in lines) using EMR.
It is surprising that Hadoop does not have a default implementation for this class made specifically for this purpose and googling it looks like I'm not the only one confused by this. I need to compile the class and bundle it in a jar for hadoop-streaming, with a limited knowledge of Java this is some challenge.
Edit:
I already tried the yetitrails example, with the necessary imports but I get a compiler error for the next method.
Here is an implementation I have for you:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.lib.CombineFileInputFormat;
import org.apache.hadoop.mapred.lib.CombineFileRecordReader;
import org.apache.hadoop.mapred.lib.CombineFileSplit;
#SuppressWarnings("deprecation")
public class CombinedInputFormat extends CombineFileInputFormat<LongWritable, Text> {
#SuppressWarnings({ "unchecked", "rawtypes" })
#Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException {
return new CombineFileRecordReader(conf, (CombineFileSplit) split, reporter, (Class) myCombineFileRecordReader.class);
}
public static class myCombineFileRecordReader implements RecordReader<LongWritable, Text> {
private final LineRecordReader linerecord;
public myCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index) throws IOException {
FileSplit filesplit = new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index), split.getLocations());
linerecord = new LineRecordReader(conf, filesplit);
}
#Override
public void close() throws IOException {
linerecord.close();
}
#Override
public LongWritable createKey() {
// TODO Auto-generated method stub
return linerecord.createKey();
}
#Override
public Text createValue() {
// TODO Auto-generated method stub
return linerecord.createValue();
}
#Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
return linerecord.getPos();
}
#Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
return linerecord.getProgress();
}
#Override
public boolean next(LongWritable key, Text value) throws IOException {
// TODO Auto-generated method stub
return linerecord.next(key, value);
}
}
}
In your job first set the parameter mapred.max.split.size according to the size you would like the input files to be combined into. Do something like follows in your run():
...
if (argument != null) {
conf.set("mapred.max.split.size", argument);
} else {
conf.set("mapred.max.split.size", "134217728"); // 128 MB
}
...
conf.setInputFormat(CombinedInputFormat.class);
...

Categories

Resources