Send multiple arguments to reducer-MapReduce - java

I've written a code which does something similar to SQL GroupBy.
The dataset I took is here:
250788681419,20090906,200937,200909,619,SUNDAY,WEEKEND,ON-NET,MORNING,OUTGOING,VOICE,25078,PAY_AS_YOU_GO_PER_SECOND_PSB,SUCCESSFUL-RELEASEDBYSERVICE,17,0,1,21.25,635-10-112-30455
public class MyMap extends Mapper<LongWritable, Text, Text, DoubleWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException
{
String line = value.toString();
String[] attribute=line.split(",");
double rs=Double.parseDouble(attribute[17]);
String comb=new String();
comb=attribute[5].concat(attribute[8].concat(attribute[10]));
context.write(new Text(comb),new DoubleWritable (rs));
}
}
public class MyReduce extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
protected void reduce(Text key, Iterator<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
double sum = 0;
Iterator<DoubleWritable> iter=values.iterator();
while (iter.hasNext())
{
double val=iter.next().get();
sum = sum+ val;
}
context.write(key, new DoubleWritable(sum));
};
}
In the Mapper, as its value sends the 17th argument to the reducer to sum it. Now I also want to sum the 14th argument how do i send it to the reducer?

If your data types are the same, then creating an ArrayWritable class should work for this. The class should resemble:
public class DblArrayWritable extends ArrayWritable
{
public DblArrayWritable()
{
super(DoubleWritable.class);
}
}
Your mapper class then looks like:
public class MyMap extends Mapper<LongWritable, Text, Text, DblArrayWritable>
{
public void map(LongWritable key, Text value, Context context) throws IOException
{
String line = value.toString();
String[] attribute=line.split(",");
DoubleWritable[] values = new DoubleWritable[2];
values[0] = Double.parseDouble(attribute[14]);
values[1] = Double.parseDouble(attribute[17]);
String comb=new String();
comb=attribute[5].concat(attribute[8].concat(attribute[10]));
context.write(new Text(comb),new DblArrayWritable.set(values));
}
}
In your reducer you should now be able to iterate over the values of the DblArrayWritable.
Based on your sample data however it looks like they may be separate types. You may be able to implement an ObjectArrayWritable class that would do the trick, but I'm not certain of this and I can't see much to support it. If it works the class would be:
public class ObjArrayWritable extends ArrayWritable
{
public ObjArrayWritable()
{
super(Object.class);
}
}
You could handle this by simply concatenating the values and passing them as Text to the reducer which would then split them again.
Another option is to implement your own Writable class. Here's a sample of how that could work:
public static class PairWritable implements Writable
{
private Double myDouble;
private String myString;
// TODO :- Override the Hadoop serialization/Writable interface methods
#Override
public void readFields(DataInput in) throws IOException {
myLong = in.readDouble();
myString = in.readUTF();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeDouble(myLong);
out.writeUTF(myString);
}
//End of Implementation
//Getter and Setter methods for myLong and mySring variables
public void set(Double d, String s) {
myDouble = d;
myString = s;
}
public Long getLong() {
return myDouble;
}
public String getString() {
return myString;
}
}

Related

Java Overriding function that one of the parameters are constant in the child class

I have a service that read and write data based on a key-value pairs. The service is generic.
I want to implement a similar class, that extends the class I described, but in the base class the key is constant.
The problem is, that If I try to override the read and write functions, they both will have the key parameter in them, although it is const.
How can I implement and override in this case? Is it possible or only without the inheritence?
My BaseService.java
class BaseService {
private HashMap<String, String> storage = new HashMap<String, String>();
void write(String key, String value) {
storage.put(key, value);
}
String read(String key) {
return storage.get(key);
}
}
and ChildService.java
class ChildService extends BaseService {
static final String KEY = 'const-key';
#override
void write(String value) {
storage.put(KEY, value);
}
#override
String read() {
return storage.get(KEY);
}
}
It isn't possible to override this way since the signature is now different.
Try something like this:
public abstract class BaseService {
HashMap<String, String> storage = new HashMap<String, String>();
abstract String getKey();
void write(String key, String value) {
storage.put(key, value);
}
String read(String key) {
return storage.get(key);
}
String read() {
return storage.get(getKey());
}
void write(String value) {
storage.put(getKey(), value);
}
}
public class ChildService extends BaseService {
static final String KEY = "const-key";
#Override
String getKey() {
return KEY;
}
}

Why does a PairFunction that extends an abstract class loses its field values when it's called?

How come when NoParentFunction.call() is called, it retains the field values whereas when HasParentFunction().call() is called, the field values are null/0?
public class Driver {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> textFile = sc.textFile("some.txt");
textFile.mapToPair(new NoParentFunction(123, "blah")).collect();
textFile.mapToPair(new HasParentFunction(123, "blah")).collect();
}
public class NoParentFunction implements PairFunction<String, String, String> {
private int myInt;
private String myString;
public NoParentFunction(int myInt, String myString) {
this.myInt = myInt;
this.myString = myString;
}
public Tuple2<String, String> call(String s) throws Exception {
System.out.println("prints myInt correctly:" + myInt);
return null;
}
}
public class HasParentFunction extends AbstractFunction implements PairFunction<String, String, String> {
public HasParentFunction(int myInt, String myString) {
super(myInt, myString);
}
public Tuple2<String, String> call(String s) throws Exception {
System.out.println("myInt: " + myInt + " is null");
return null;
}
}
public abstract class AbstractFunction {
protected int myInt;
protected String myString;
public AbstractFunction(int myInt, String myString) {
this.myInt = myInt;
this.myString = myString;
}
abstract Tuple2<String, String> call(String s) throws Exception;
}
I can see why HasParentFunction loses its field values, due to AbstractFunction's no-args constructor being called. But how come this does not happen in the class that does not extend the abstract class?

Hadoop mapreduce custom writable static context

I'm working on an university homework and we have to use hadoop mapreduce for it. I'm trying to create a new custom writable as I want to output key-value pairs as (key, (doc_name, 1)).
public class Detector {
private static final Path TEMP_PATH = new Path("temp");
private static final String LENGTH = "gramLength";
private static final String THRESHOLD = "threshold";
public class Custom implements Writable {
private Text document;
private IntWritable count;
public Custom(){
setDocument("");
setCount(0);
}
public Custom(String document, int count) {
setDocument(document);
setCount(count);
}
#Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
document.readFields(in);
count.readFields(in);
}
#Override
public void write(DataOutput out) throws IOException {
document.write(out);
count.write(out);
}
public int getCount() {
return count.get();
}
public void setCount(int count) {
this.count = new IntWritable(count);
}
public String getDocument() {
return document.toString();
}
public void setDocument(String document) {
this.document = new Text(document);
}
}
public static class NGramMapper extends Mapper<Text, Text, Text, Text> {
private int gramLength;
private Pattern space_pattern=Pattern.compile("[ ]");
private StringBuilder gramBuilder= new StringBuilder();
#Override
protected void setup(Context context) throws IOException, InterruptedException{
gramLength=context.getConfiguration().getInt(LENGTH, 0);
}
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens=space_pattern.split(value.toString());
for(int i=0;i<tokens.length;i++){
gramBuilder.setLength(0);
if(i+gramLength<=tokens.length){
for(int j=i;j<i+gramLength;j++){
gramBuilder.append(tokens[j]);
gramBuilder.append(" ");
}
context.write(new Text(gramBuilder.toString()), key);
}
}
}
}
public static class OutputReducer extends Reducer<Text, Text, Text, Custom> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text val : values) {
context.write(key,new Custom(val.toString(),1));
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.setInt(LENGTH, Integer.parseInt(args[0]));
conf.setInt(THRESHOLD, Integer.parseInt(args[1]));
// Setup first MapReduce phase
Job job1 = Job.getInstance(conf, "WordOrder-first");
job1.setJarByClass(Detector.class);
job1.setMapperClass(NGramMapper.class);
job1.setReducerClass(OutputReducer.class);
job1.setMapOutputKeyClass(Text.class);
job1.setMapOutputValueClass(Text.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(Custom.class);
job1.setInputFormatClass(WholeFileInputFormat.class);
FileInputFormat.addInputPath(job1, new Path(args[2]));
FileOutputFormat.setOutputPath(job1, new Path(args[3]));
boolean status1 = job1.waitForCompletion(true);
if (!status1) {
System.exit(1);
}
}
}
When I compile the code to a class file i get this error:
Detector.java:147: error: non-static variable this cannot be referenced from a static context
context.write(key,new Custom(val.toString(),1));
I followed differents tutorials about custom writable and my solution is the same as the others. Any suggestion?
Static fields and methods are shared with all instances. They are for values which are specific to the class and not a specific instance. Stay out of them as much as possible.
To solve your problem, you need to instantiate an instance (create an object) of your class so the run-time can reserve memory for the instance; or change the part you are accessing it to have static access (not recommended!).
The keyword this is for referencing something that's indeed an instance (hence the this thing) and not something that's static, which in that case should be referenced by the class name instead. You are using it in a static context which is not allowed.

Serialize generic field from java object to json

I've a generic field in User.java. I want to use the value of T in json.
public class User<T> {
public enum Gender {MALE, FEMALE};
private T field;
private Gender _gender;
private boolean _isVerified;
private byte[] _userImage;
public T getField() { return field; }
public boolean isVerified() { return _isVerified; }
public Gender getGender() { return _gender; }
public byte[] getUserImage() { return _userImage; }
public void setField(T f) { field = f; }
public void setVerified(boolean b) { _isVerified = b; }
public void setGender(Gender g) { _gender = g; }
public void setUserImage(byte[] b) { _userImage = b; }
}
and mapper class is:
public class App
{
public static void main( String[] args ) throws JsonParseException, JsonMappingException, IOException
{
ObjectMapper mapper = new ObjectMapper();
Name n = new Name();
n.setFirst("Harry");
n.setLast("Potter");
User<Name> user = new User<Name>();
user.setField(n);
user.setGender(Gender.MALE);
user.setVerified(false);
mapper.writeValue(new File("user1.json"), user);
}
}
and the json output is :
{"field":{"first":"Harry","last":"Potter"},"gender":"MALE","verified":false,"userImage":null}
In the output, i want Name to be appeared in place of field. How do i do that. Any help?
I think what u ask is not JSON's default behavior. Field name is the "key" of the json map, not the variable name. U should rename the field or make some String process to do it.
private T field;
change the above to this:
private T name;
You need a custom serializer to do that. That's a runtime data transformation and Jackson has no support for data transformation other than with a custom serializer (well, there's wrapping/unwrapping of value, but let's not go there). Also, you will need to know in advance every type of transformation you want to apply inside your serializer. The following works:
public class UserSerializer extends JsonSerializer<User<?>> {
private static final String USER_IMAGE_FIELD = "userImage";
private static final String VERIFIED_FIELD = "verified";
private static final String FIELD_FIELD = "field";
private static final String NAME_FIELD = "name";
#Override
public void serialize(User<?> value, JsonGenerator jgen, SerializerProvider provider) throws IOException,
JsonProcessingException {
jgen.writeStartObject();
if (value.field instanceof Name) {
jgen.writeFieldName(NAME_FIELD);
} else {
jgen.writeFieldName(FIELD_FIELD);
}
jgen.writeObject(value.field);
jgen.writeStringField("gender", value._gender.name());
jgen.writeBooleanField(VERIFIED_FIELD, value._isVerified);
if (value._userImage == null) {
jgen.writeNullField(USER_IMAGE_FIELD);
} else {
jgen.writeBinaryField(USER_IMAGE_FIELD, value._userImage);
}
jgen.writeEndObject();
}
}

Enum value implementing Writable interface of Hadoop

Suppose I have an enumeration:
public enum SomeEnumType implements Writable {
A(0), B(1);
private int value;
private SomeEnumType(int value) {
this.value = value;
}
#Override
public void write(final DataOutput dataOutput) throws IOException {
dataOutput.writeInt(this.value);
}
#Override
public void readFields(final DataInput dataInput) throws IOException {
this.value = dataInput.readInt();
}
}
I want to pass an instance of it as a part of some other class instance.
The equals would not work, because it will not consider the inner variable of enumeration, plus all enum instances are fixed at compile time and could not be created elsewhere.
Does it mean I could not send enums over the wire in Hadoop or there's a solution?
My normal and preferred solution for enums in Hadoop is serializing the enums through their ordinal value.
public class EnumWritable implements Writable {
static enum EnumName {
ENUM_1, ENUM_2, ENUM_3
}
private int enumOrdinal;
// never forget your default constructor in Hadoop Writables
public EnumWritable() {
}
public EnumWritable(Enum<?> arbitraryEnum) {
this.enumOrdinal = arbitraryEnum.ordinal();
}
public int getEnumOrdinal() {
return enumOrdinal;
}
#Override
public void readFields(DataInput in) throws IOException {
enumOrdinal = in.readInt();
}
#Override
public void write(DataOutput out) throws IOException {
out.writeInt(enumOrdinal);
}
public static void main(String[] args) {
// use it like this:
EnumWritable enumWritable = new EnumWritable(EnumName.ENUM_1);
// let Hadoop do the write and read stuff
EnumName yourDeserializedEnum = EnumName.values()[enumWritable.getEnumOrdinal()];
}
}
Obviously it has drawbacks: Ordinals can change, so if you exchange ENUM_2 with ENUM_3 and read a previously serialized file, this will return the other wrong enum.
So if you know the enum class beforehand, you can write the name of your enum and use it like this:
enumInstance = EnumName.valueOf(in.readUTF());
This will use slightly more space, but it is more save to changes to your enum names.
The full example would look like this:
public class EnumWritable implements Writable {
static enum EnumName {
ENUM_1, ENUM_2, ENUM_3
}
private EnumName enumInstance;
// never forget your default constructor in Hadoop Writables
public EnumWritable() {
}
public EnumWritable(EnumName e) {
this.enumInstance = e;
}
public EnumName getEnum() {
return enumInstance;
}
#Override
public void write(DataOutput out) throws IOException {
out.writeUTF(enumInstance.name());
}
#Override
public void readFields(DataInput in) throws IOException {
enumInstance = EnumName.valueOf(in.readUTF());
}
public static void main(String[] args) {
// use it like this:
EnumWritable enumWritable = new EnumWritable(EnumName.ENUM_1);
// let Hadoop do the write and read stuff
EnumName yourDeserializedEnum = enumWritable.getEnum();
}
}
WritableUtils has convenience methods that make this easier.
WritableUtils.writeEnum(dataOutput,enumData);
enumData = WritableUtils.readEnum(dataInput,MyEnum.class);
I don't know anything about Hadoop, but based on the documentation of the interface, you could probably do it like that:
public void readFields(DataInput in) throws IOException {
// do nothing
}
public static SomeEnumType read(DataInput in) throws IOException {
int value = in.readInt();
if (value == 0) {
return SomeEnumType.A;
}
else if (value == 1) {
return SomeEnumType.B;
}
else {
throw new IOException("Invalid value " + value);
}
}

Categories

Resources