How to display current accumulator value updated in DStream? - java

I am processing a java jar. The accumulator adds up the stream values. The problem is, I want to display the value in my UI every time it increments or in a specific periodic interval.
But, Since the accumulators value can only be got from the Driver program, I am not able to access this value until the process finishes its execution. any idea on how i can access this value periodically?
My code is as given below
package com.spark;
import java.util.HashMap;
import java.util.Map;
import org.apache.spark.Accumulator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
public class KafkaSpark {
/**
* #param args
*/
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Simple Application");
conf.setMaster("local");
JavaStreamingContext jssc = new JavaStreamingContext(conf,
new Duration(5000));
final Accumulator<Integer> accum = jssc.sparkContext().accumulator(0);
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put("test", 1);
JavaPairDStream<String, String> lines = KafkaUtils.createStream(jssc,
"localhost:2181", "group1", topicMap);
JavaDStream<Integer> map = lines
.map(new Function<Tuple2<String, String>, Integer>() {
public Integer call(Tuple2<String, String> v1)
throws Exception {
if (v1._2.contains("the")) {
accum.add(1);
return 1;
}
return 0;
}
});
map.print();
jssc.start();
jssc.awaitTermination();
System.out.println("*************" + accum.value());
System.out.println("done");
}
}
I am streaming data using Kafka.

In spark only when jssc.star() is called the actual code starts to execute. Now the control is with spark it starts to run the loop, all you system.out.println will be called only once. and will not be executed with the loop everytime.
For out put operations check the documentation
you can either use
print()
forEachRDD()
save as object text or hadoop file
Hope this helps

jssc.start();
while(true) {
System.out.println("current:" + accum.value());
Thread.sleep(1000);
}

Related

Kafka Streams Twitter Wordcount - Count Value not Long after Serialization

I am running a Kafka Cluster Docker Compose on an AWS EC2 instance.
I want to receive all the tweets of a specific keyword and push them to Kafka. This works fine.
But I also want to count the most used words of those tweets.
This is the WordCount code:
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.StreamsBuilder;
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;
import java.util.concurrent.CountDownLatch;
import static org.apache.kafka.streams.StreamsConfig.APPLICATION_ID_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.BOOTSTRAP_SERVERS_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG;
public class WordCount {
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, String> textLines = builder
.stream("test-topic");
textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupBy((key, value) -> value)
.count(Materialized.as("WordCount"))
.toStream()
.to("test-output", Produced.with(Serdes.String(), Serdes.Long()));
final Topology topology = builder.build();
Properties props = new Properties();
props.put(APPLICATION_ID_CONFIG, "streams-word-count");
props.put(BOOTSTRAP_SERVERS_CONFIG, "ec2-ip:9092");
props.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(
new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
}
When I check the output topic in the Control Center, it looks like this:
Key
Value
Looks like it's working as far as splitting the tweets into single words. But the count value isn't in Long format, although it is specified in the code.
When I use the kafka-console-consumer to consume from this topic, it says:
"Size of data received by LongDeserializer is not 8"
Control Center UI and console consumer can only render UTF8 data, by default.
You'll need to explicitly pass LongDeserializer to the console consumer, as the value deserializer only
try a KTable instead:
KStream<String, String> textLines = builder.stream("test-topic", Consumed.with(stringSerde, stringSerde));
KTable<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupBy((key, value) -> value)
.count()
.toStream()
.to("test-output", Produced.with(Serdes.String(), Serdes.Long()));

Kafka Streams - Fields in the Custom object changing to null while doing Aggregation

I've written a simple Kafka Stream processor code to
Read messages as stream from a topic with <K, V> as <String, String>
Convert the value in the message from String to a Custom Object <String, Object> using mapValues() method
Use Window function to aggregate the statistics of the Objects for a particular time interval
Sample Message
{"coiRequestGuid":"xxxx","accountId":1122132,"companyName":"xxxx","existingPolicyCoverageLimit":1000000,"isChangeRequested":true,"newlyRequestedPolicyCoverageLimit":200000,"isNewRecipient":false,"newRecipientGuid":null,"existingRecipientId":11111,"recipientName":"xxxx","recipientEmail":"xxxxx"}
Here is my code
import com.da.app.data.model.PolicyChangeRequest;
import com.da.app.data.model.PolicyChangeRequestStats;
import com.da.app.data.serde.JsonDeserializer;
import com.da.app.data.serde.JsonSerializer;
import com.da.app.data.serde.WrapperSerde;
import com.da.app.system.util.ConfigUtil;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.utils.Bytes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.streams.state.WindowStore;
import org.apache.log4j.Logger;
import org.json.JSONObject;
import java.time.Duration;
import java.util.Properties;
public class PolicyChangeReqStreamProcessor {
private static final Logger logger = Logger.getLogger(PolicyChangeReqStreamProcessor.class);
private static final String TOPIC_NAME = "stream-window-play1";
private static ObjectMapper mapper = new ObjectMapper();
private static Properties properties = ConfigUtil.loadProperty();
public static void main(String[] args) {
logger.info("Policy Limit Change Stats Generator");
Properties streamProperties = getStreamProperties();
StreamsBuilder streamsBuilder = new StreamsBuilder();
KStream<String, String> source = streamsBuilder.stream(TOPIC_NAME,
Consumed.with(Serdes.String(), Serdes.String()));
source
.filter((key, value) -> isValidEvent(value))
//Converting the request json to PolicyChangeRequest object
.mapValues(PolicyChangeReqStreamProcessor::convertPolicyChangeReqJsonToObj)
//Mapping all events to a single key in order to group all the events
.map((key, value) -> new KeyValue<>("key", value))
// Grouping by key
.groupByKey(Grouped.with(Serdes.String(), new PolicyChangeRequestSerde()))
//Creating a Tumbling window of 5 secs (for Testing)
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)).advanceBy(Duration.ofSeconds(5)))
// Aggregating the PolicyChangeRequest events to a
// PolicyChangeRequestStats object
.<PolicyChangeRequestStats>aggregate(PolicyChangeRequestStats::new,
(k, v, policyStats) -> policyStats.add(v),
Materialized.<String, PolicyChangeRequestStats, WindowStore<Bytes, byte[]>>as
("policy-change-aggregates")
.withValueSerde(new PolicyChangeRequestStatsSerde()))
//Converting KTable to KStream
.toStream()
.foreach((key, value) -> logger.info(key.window().startTime() + "----" + key.window().endTime() + " :: " + value));
KafkaStreams kafkaStreams = new KafkaStreams(streamsBuilder.build(), streamProperties);
logger.info("Started the stream");
kafkaStreams.start();
Runtime.getRuntime().addShutdownHook(new Thread(kafkaStreams::close));
}
private static PolicyChangeRequest convertPolicyChangeReqJsonToObj(String policyChangeReq) {
JSONObject policyChangeReqJson = new JSONObject(policyChangeReq);
PolicyChangeRequest policyChangeRequest = new PolicyChangeRequest(policyChangeReqJson);
// return mapper.readValue(value, PolicyChangeRequest.class);
return policyChangeRequest;
}
private static boolean isValidEvent(String value) {
//TODO: Message Validation
return true;
}
private static Properties getStreamProperties() {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "policy-change-stats-gen");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, properties.getProperty("kafka.bootstrap.servers"));
props.put(StreamsConfig.CLIENT_ID_CONFIG, "stream-window-play1");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
return props;
}
public static final class PolicyChangeRequestStatsSerde extends WrapperSerde<PolicyChangeRequestStats> {
PolicyChangeRequestStatsSerde() {
super(new JsonSerializer<>(), new JsonDeserializer<>(PolicyChangeRequestStats.class));
}
}
public static final class PolicyChangeRequestSerde extends WrapperSerde<PolicyChangeRequest> {
PolicyChangeRequestSerde() {
super(new JsonSerializer<>(), new JsonDeserializer<>(PolicyChangeRequest.class));
}
}
}
isValidEvent - returns true
.mapValues(PolicyChangeReqStreamProcessor::convertPolicyChangeReqJsonToObj) - This will convert the incoming json string to a PolicyChangeRequest object
Till the operation map((key, value) -> new KeyValue<>("key", value)), the Custom Object - PolicyChangeRequest is fine as per the incoming message (I've tested by printing the stream there).
But after going into the Groupby and aggregate operation, the Custom Object got changed as
PolicyChangeRequest{coiRequestGuid='null', accountId='null', companyName='null', existingPolicyCoverageLimit=null, isChangeRequested=null, newlyRequestedPolicyCoverageLimit=null, isNewRecipient=null, newRecipientGuid='null', existingRecipientId='null', recipientName='null', recipientEmail='null'}
I found the above value by putting a log statement inside the policyStats.add(v) method i've called inside the aggregate method.
The add method is in the PolicyChangeRequestStats class
public PolicyChangeRequestStats add(PolicyChangeRequest policyChangeRequest) {
System.out.println("Incoming req: " + policyChangeRequest);
//Incrementing the Policy limit change request count
this.policyLimitChangeRequests++;
//Adding the Increased policy limit coverage to the existing increasedPolicyLimitCoverage
this.increasedPolicyLimitCoverage +=
(policyChangeRequest.getNewlyRequestedPolicyCoverageLimit() -
policyChangeRequest.getExistingPolicyCoverageLimit());
return this;
}
I'm getting NullPointerException in the line where I'm adding the policyChangeRequest.getNewlyRequestedPolicyCoverageLimit() - policyChangeRequest.getExistingPolicyCoverageLimit() as the values were null in the PolicyChangeRequest object
I've provided the valid Serde classes for the key and Value while doing groupBy .groupByKey(Grouped.with(Serdes.String(), new PolicyChangeRequestSerde())).
For Serialization and Desrialization I used Gson.
But I can't able to get the PolicyChangeRequest object as is before it was sent to the grouping operation.
I'm new to kafka Streams and I'm not sure whether I missed anything or whether the process I'm doing is correct or not.
Can anyone guide me here?

How to write data from Kafka topic to file using KStreams?

I am trying to create a KStream application in Eclipse using Java. right now I am referring to the word count program available on the internet for KStreams and modifying it.
What I want is that the data that I am reading from the input topic should be written to a file instead of being written to another output topic.
But when I am trying to print the KStream/KTable to the local file, I am getting the following entry in the output file:
org.apache.kafka.streams.kstream.internals.KStreamImpl#4c203ea1
How do I implement redirecting the output from the KStream to a file?
Below is the code:
package KStreamDemo.kafkatest;
package org.apache.kafka.streams.examples.wordcount;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KTable;
import org.apache.kafka.streams.kstream.KeyValueMapper;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.kstream.ValueMapper;
import java.util.Arrays;
import java.util.Locale;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
public class TemperatureDemo {
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "34.73.184.104:9092");
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
System.out.println("#1###################################################################################################################################################################################");
// setting offset reset to earliest so that we can re-run the demo code with the same pre-loaded data
// Note: To re-run the demo, you need to use the offset reset tool:
// https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Application+Reset+Tool
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsBuilder builder = new StreamsBuilder();
System.out.println("#2###################################################################################################################################################################################");
KStream<String, String> source = builder.stream("iot-temperature");
System.out.println("#5###################################################################################################################################################################################");
KTable<String, Long> counts = source
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
return Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" "));
}
})
.groupBy(new KeyValueMapper<String, String, String>() {
#Override
public String apply(String key, String value) {
return value;
}
})
.count();
System.out.println("#3###################################################################################################################################################################################");
System.out.println("OUTPUT:"+ counts);
System.out.println("#4###################################################################################################################################################################################");
// need to override value serde to Long type
counts.toStream().to("iot-temperature-max", Produced.with(Serdes.String(), Serdes.Long()));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-wordcount-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
}
This is not correct
System.out.println("OUTPUT:"+ counts);
You would need to do counts.foreach, then print the messages out to a file.
Print Kafka Stream Input out to console? (just update to write to file instead)
However, probably better to write out the stream to a topic. And the use Kafka Connect to write out to a file. This is a more industry-standard pattern. Kafka Streams is encouraged to only move data between topics within Kafka, not integrate with external systems (or filesystems)
Edit connect-file-sink.properties with the topic information you want, then
bin/connect-standalone config/connect-file-sink.properties

Spark - Extract line after String match and save it in ArrayList

I am new to spark and trying to extract a line which contains "Subject:" and save it in an arraylist. I am not facing any error but the array list is empty. Can you please guide me where am i going wrong? or the best way to do this?
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
public final class extractSubject {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("JavaBookExample");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaRDD<String> sample = sc.textFile("/Users/Desktop/sample.txt");
final ArrayList<String> list = new ArrayList<>();
sample.foreach(new VoidFunction<String>(){
public void call(String line) {
if (line.contains("Subject:")) {
System.out.println(line);
list.add(line);
}
}}
);
System.out.println(list);
sc.stop();
}
}
Please keep in mind that Spark applications run distributed and in parallel. Therefore you cannot modify variables outside of functions that are executed by Spark.
Instead you need to return a result from these functions. In your case you need flatMap (instead of foreach that has no result), which concatenates collections that are returned as result of your function.
If a line matches a list that contains the matching line is returned, otherwise you return an empty list.
To print the data in the main function, you first have to gather the possibly distributed data in your master node, by calling collect().
Here an example:
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
public final class extractSubject {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("JavaBookExample");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
//JavaRDD<String> sample = sc.textFile("/Users/Desktop/sample.txt");
JavaRDD<String> sample = sc.parallelize(Arrays.asList("Subject: first",
"nothing here",
"Subject: second",
"dummy"));
JavaRDD<String> subjectLinesRdd = sample.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
if (line.contains("Subject:")) {
return Collections.singletonList(line); // line matches → return list with the line as its only element
} else {
return Collections.emptyList(); // ignore line → return empty list
}
}
});
List<String> subjectLines = subjectLinesRdd.collect(); // collect values from Spark workers
System.out.println(subjectLines); // → "[Subject: first, Subject: second]"
sc.stop();
}
}

MapReduce - Reduce is not called

I've been trying to run this project which I found in internet and altered for my intention.
Map function is called and works properly, I checked the results from console. But reduce is not getting called
First two digits are key and rest is value.
I've controlled the match between map output and reduce input key,value pairs, I've changed them many times, tried different things but couldn't get a solution.
Since I'm a beginner in this topic probably there is a small mistake. I wrote an other project and had the same mistake again "reduce is not called"
I also tried to change output valu class of reduce to IntWritable, TextWritable instead of MedianStdDevTuple and configured the job but nothing changed.
I don't need the solution only, want to know the reason as well. Thanks.
here is the code
package usercommend;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.logging.Log;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.htrace.commons.logging.LogFactory;
import usercommend.starter.map;
public class starter extends Configured implements Tool {
public static void main (String[] args) throws Exception{
int res =ToolRunner.run(new starter(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Job job=Job.getInstance(getConf(),"starter");
job.setJarByClass(this.getClass());
job.setMapperClass(map.class);
job.setReducerClass(reduces.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(MedianStdDevTuple.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static class map extends Mapper<LongWritable, Text,IntWritable, IntWritable> {
private IntWritable outHour = new IntWritable();
private IntWritable outCommentLength = new IntWritable();
private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
#SuppressWarnings("deprecation")
#Override
public void map(LongWritable key , Text value,Context context) throws IOException, InterruptedException
{
//System.err.println(value.toString()+"vv");
Map<String, String> parsed = transforXmlToMap1(value.toString());
//System.err.println("1");
String strDate = parsed.get("CreationDate");
//System.err.println(strDate);
String text = parsed.get("Text");
//System.err.println(text);
Date creationDate=new Date();
try {
// System.err.println("basla");
creationDate = frmt.parse(strDate);
outHour.set(creationDate.getHours());
outCommentLength.set(text.length());
System.err.println(outHour+""+outCommentLength);
context.write(outHour, outCommentLength);
} catch (ParseException e) {
// TODO Auto-generated catch block
System.err.println("catch");
e.printStackTrace();
return;
}
//context.write(new IntWritable(2), new IntWritable(12));
}
public static Map<String,String> transforXmlToMap1(String xml) {
Map<String, String> map = new HashMap<String, String>();
try {
String[] tokens = xml.trim().substring(5, xml.trim().length()-3).split("\"");
for(int i = 0; i < tokens.length-1 ; i+=2) {
String key = tokens[i].trim();
String val = tokens[i+1];
map.put(key.substring(0, key.length()-1),val);
//System.err.println(val.toString());
}
} catch (StringIndexOutOfBoundsException e) {
System.err.println(xml);
}
return map;
}
}
public static class reduces extends Reducer<IntWritable, IntWritable, IntWritable, MedianStdDevTuple> {
private MedianStdDevTuple result = new MedianStdDevTuple();
private ArrayList<Float> commentLengths = new ArrayList<Float>();
Log log=(Log) LogFactory.getLog(this.getClass());
#Override
public void reduce(IntWritable key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{
System.out.println("1");
log.info("aa");
float sum = 0;
float count = 0;
commentLengths.clear();
result.setStdDev(0);
for(IntWritable val : values) {
commentLengths.add((float)val.get());
sum+=val.get();
++count;
}
Collections.sort(commentLengths);
if(count % 2 ==0) {
result.setMedian((commentLengths.get((int)count / 2 -1)+
commentLengths.get((int) count / 2)) / 2.0f);
} else {
result.setMedian(commentLengths.get((int)count / 2));
}
double avg = sum/commentLengths.size();
double totalSquare = 0;
for(int i =0 ;i<commentLengths.size();i++) {
double diff = commentLengths.get(i)-avg;
totalSquare += (diff*diff);
}
double stdSapma= Math.sqrt(totalSquare/(commentLengths.size()));
result.setStdDev(stdSapma);
context.write(key, result);
}
}
}
sample input
<row Id="2" PostId="7" Score="0" Text="I see what you mean, but I've had Linux systems set up so that if the mouse stayed on a window for a certain time period (greater than zero), then that window became active. That would be one solution. Another would be to simply let clicks pass to whatever control they are over, whether it is in the currently active window or not. Is that doable?" CreationDate="2010-08-17T19:38:20.410" UserId="115" />
<row Id="3" PostId="13" Score="1" Text="I am using Iwork and OpenOffice right now But I need some features that just MS has it." CreationDate="2010-08-17T19:42:04.487" UserId="135" />
<row Id="4" PostId="17" Score="0" Text="I've been using that on my MacBook Pro since I got it, with no issues. Last week I got an iMac and immediately installed StartSound.PrefPane but it doesn't work -- any ideas why? The settings on the two machines are identical (except the iMac has v1.1b3 instead of v1.1b2), but one is silent at startup and the other isn't...." CreationDate="2010-08-17T19:42:15.097" UserId="115" />
<row Id="5" PostId="6" Score="0" Text="+agreed. I would add that I think you can choose to not clone everything so it takes less time to make a bootable volume" CreationDate="2010-08-17T19:44:00.270" UserId="2" />
<row Id="6" PostId="22" Score="2" Text="Applications are removed from memory by the OS at it's discretion. Just because they are in the 'task manager' does not imply they are running and in memory. I have confirmed this with my own apps.
After a reboot, these applications are not reloaded until launched by a user." CreationDate="2010-08-17T19:46:01.950" UserId="589" />
<row Id="7" PostId="7" Score="0" Text="Honestly, I don't know. It's definitely interesting though. I'm currently scouring Google, since it would save on input clicks. I'm just concerned that any solution might get a little "hack-y" and not behave consistently in all UI elements or applications. The last thing I'd want is to not know if I'm focusing a window or pressing a button :(" CreationDate="2010-08-17T19:50:00.723" UserId="421" />
<row Id="8" PostId="21" Score="3" Text="Could you expand on the features for those not familiar with ShakesPeer?" CreationDate="2010-08-17T19:51:11.953" UserId="581" />
<row Id="9" PostId="23" Score="1" Text="Apple's vernacular is Safe Sleep." CreationDate="2010-08-17T19:51:35.557" UserId="171" />
You took this code? I'm guessing the problem is that you did not set the correct inputs and outputs for the Job.
Here is what you are trying to do based on your class definitions.
Map Input: (Object, Text)
Map Output: (IntWritable, IntWritable)
Reduce Input: (IntWritable, IntWritable)
Reduce Output: (IntWritable, MedianStdDevTuple)
But, based on your Job configuration
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(MedianStdDevTuple.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
It thinks you want to do this
Map Input: (Object, Text) -- I think it's actually LongWritable instead Object, though, for file-split locations
Map Output: (IntWritable,MedianStdDevTuple)
Reduce Input: (IntWritable, IntWritable)
Reduce Output: (Text,IntWritable)
Notice how those are different? Your reducer is expecting to read in IntWritable instead of MedianStdDevTuple, and the outputs are also of the incorrect class, therefore, it doesn't run.
To fix, change your job configuration like so
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(MedianStdDevTuple.class);
Edit: Got it to run fine and the only thing I really changed outside of the code in the link above was the mapper class with this method.
public static Map<String, String> transforXmlToMap1(String xml) {
Map<String, String> map = new HashMap<String, String>();
try {
String[] tokens = xml.trim().substring(5, xml.trim().length() - 3)
.split("\"");
for (int i = 0; i < tokens.length - 1; i += 2) {
String key = tokens[i].replaceAll("[= ]", "");
String val = tokens[i + 1];
map.put(key, val);
// System.err.println(val.toString());
}
} catch (StringIndexOutOfBoundsException e) {
System.err.println(xml);
}
return map;
}

Categories

Resources