Finding largest line number using Spark java - java

I am facing a problem in which i have to find out the largest line and its index. Here is my approach
SparkConf conf = new SparkConf().setMaster("local").setAppName("basicavg");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> rdd = sc.textFile("/home/impadmin/ravi.txt");
JavaRDD<Tuple2<Integer,String>> words = rdd.map(new Function<String, Tuple2<Integer,String>>() {
#Override
public Tuple2<Integer,String> call(String v1) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Integer, String>(v1.split(" ").length, v1);
}
});
JavaPairRDD<Integer, String> linNoToWord = JavaPairRDD.fromJavaRDD(words).sortByKey(false);
System.out.println(linNoToWord.first()._1+" ********************* "+linNoToWord.first()._2);

In this way the tupleRDD will get sorted on the basis of key and the first element in the new rdd after sorting is of highest length:
JavaRDD<String> rdd = sc.textFile("/home/impadmin/ravi.txt");
JavaRDD<Tuple2<Integer,String>> words = rdd.map(new Function<String, Tuple2<Integer,String>>() {
#Override
public Tuple2<Integer,String> call(String v1) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Integer, String>(v1.split(" ").length, v1);
}
});
JavaRDD<Tuple2<Integer,String>> tupleRDD1= tupleRDD.sortBy(new Function<Tuple2<Integer,String>, Integer>() {
#Override
public Integer call(Tuple2<Integer, String> v1) throws Exception {
// TODO Auto-generated method stub
return v1._1;
}
}, false, 1);
System.out.println(tupleRDD1.first());
}

Since you are concerned with the line number and text both, please try this.
First create a serializable class Line :
public static class Line implements Serializable {
public Line(Long lineNo, String text) {
lineNo_ = lineNo;
text_ = text;
}
public Long lineNo_;
public String text_;
}
Then do the following operations:
SparkConf conf = new SparkConf().setMaster("local[1]").setAppName("basicavg");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> rdd = sc.textFile("/home/impadmin/words.txt");
JavaPairRDD<Long, Line> linNoToWord2 = rdd.zipWithIndex().mapToPair(new PairFunction<Tuple2<String,Long>, Long, Line>() {
public Tuple2<Long, Line> call(Tuple2<String, Long> t){
return new Tuple2<Long, Line>(Long.valueOf(t._1.split(" ").length), new Line(t._2, t._1));
}
}).sortByKey(false);
System.out.println(linNoToWord2.first()._1+" ********************* "+linNoToWord2.first()._2.text_);

Related

SpakrSQL Generate new column with UUID

I have to add new column with value of UUID. I have done this using Spark 1.4 Java using following code.
StructType objStructType = inputDataFrame.schema();
StructField []arrStructField=objStructType.fields();
List<StructField> fields = new ArrayList<StructField>();
List<StructField> newfields = new ArrayList<StructField>();
List <StructField> listFields = Arrays.asList(arrStructField);
StructField a = DataTypes.createStructField(leftCol,DataTypes.StringType, true);
fields.add(a);
newfields.addAll(listFields);
newfields.addAll(fields);
final int size = objStructType.size();
JavaRDD<Row> rowRDD = inputDataFrame.javaRDD().map(new Function<Row, Row>() {
private static final long serialVersionUID = 3280804931696581264L;
public Row call(Row tblRow) throws Exception {
Object[] newRow = new Object[size+1];
int rowSize= tblRow.length();
for (int itr = 0; itr < rowSize; itr++)
{
if(tblRow.apply(itr)!=null)
{
newRow[itr] = tblRow.apply(itr);
}
}
newRow[size] = UUID.randomUUID().toString();
return RowFactory.create(newRow);
}
});
inputDataFrame = objsqlContext.createDataFrame(rowRDD, DataTypes.createStructType(newfields));
I'm wondering if there is some neat way to doing in Spark 2. Please advice.
You can register udf for getting UUID and use callUDF function to add new column to your inputDataFrame. Please see the sample code using Spark 2.0.
public class SparkUUIDSample {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("SparkUUIDSample").master("local[*]").getOrCreate();
//sample input data
List<Tuple2<String, String>> inputList = new ArrayList<Tuple2<String, String>>();
inputList.add(new Tuple2<String, String>("A", "v1"));
inputList.add(new Tuple2<String, String>("B", "v2"));
//dataset
Dataset<Row> df = spark.createDataset(inputList, Encoders.tuple(Encoders.STRING(), Encoders.STRING())).toDF("key", "value");
df.show();
//register udf
UDF1<String, String> uuid = str -> UUID.randomUUID().toString();
spark.udf().register("uuid", uuid, DataTypes.StringType);
//call udf
df.select(col("*"), callUDF("uuid", col("value"))).show();
//stop
spark.stop();
}
}

How to get a JavaDStream of an Object in Spark Kafka Connector?

I am using the Spark Kafka connector to fetch data from Kafka cluster. From it, I am getting the data as a JavaDStream<String>. How do I get the data as a JavaDStream<EventLog>, where EventLog is a Java bean?
public static JavaDStream<EventLog> fetchAndValidateData(String zkQuorum, String group, Map<String, Integer> topicMap) {
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, zkQuorum, group, topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
jssc.start();
jssc.awaitTermination();
return lines;
}
My goal is to save this data into Cassandra where a table with the same specifications as EventLog. The Spark Cassandra connector accepts JavaRDD<EventLog> in the insert statement like this: javaFunctions(rdd).writerBuilder("ks", "event", mapToRow(EventLog.class)).saveToCassandra();. I want to get these JavaRDD<EventLog> from Kafka.
Use the overloaded createStream method where you can pass the key/value type and decoder classes.
Example:
createStream(jssc, String.class, EventLog.class, StringDecoder.class, EventLogDecoder.class,
kafkaParams, topicsMap, StorageLevel.MEMORY_AND_DISK_SER_2());
Above should give you JavaPairDStream<String, EventLog>
JavaDStream<EventLog> lines = messages.map(new Function<Tuple2<String, EventLog>, EventLog>() {
#Override
public EventLog call(Tuple2<String, EventLog> tuple2) {
return tuple2._2();
}
});
The EventLogDecoder should implement kafka.serializer.Decoder. Below example for json decoder.
public class EventLogDecoder implements Decoder<EventLog> {
public EventLogDecoder(VerifiableProperties verifiableProperties) {
}
#Override
public EventLog fromBytes(byte[] bytes) {
ObjectMapper objectMapper = new ObjectMapper();
try {
return objectMapper.readValue(bytes, EventLog.class);
} catch (IOException e) {
//do something
}
return null;
}
}

Apache Spark Kafka Streaming getting unlimited jobs

I am running the following Spark program in Java. I am getting data from a Kafka cluster. The problem is that this program is running forever. It is taking RDDs even when there are only three lines of data in Kafka. How do I stop it from taking in more RDDs after the data has been consumed?
The print statements are also not displaying for some reason. I don't know why.
public final class SparkKafkaConsumer {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = 3;
Map<String, Integer> topicMap = new HashMap<String, Integer>();
String[] topics = "dddccceeffg".split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, "localhost:2181", "NameConsumer", topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Lists.newArrayList(COMMA.split(x));
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
jssc.start();
jssc.awaitTermination();
}
}
My terminal command is: C:\spark-1.6.2-bin-hadoop2.6\bin\spark-submit --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.2 --class "SparkKafkaConsumer" --master local[4] target\simple-project-1.0.jar

WordCount with Guaranteeing-message-processing

I am trying to run WordCount Example with Guaranteeing message processing.
There is one spout
WSpout - emitting random sentences with msgID.
and two bolts
SplitSentence - spliting sentence in words and emit with anchoring
WordCount - printing words count.
What i wanted to achieve with below code is that when all words counting for a sentence would be done. Spout corresponding to that sentence must be acknowledged.
I am acknowledging with _collector.ack(tuple) at last bolt WordCount only. I see strange is that inspite of ack() is getting called at WordCount.execute() , corresponding WSpout.ack() is not getting called. it is always failed after default timeout.
I really don't understand whats wrong with code. Please help me understand the problem.
Any help appreciated.
Below is complete Code.
public class TestTopology {
public static class WSpout implements IRichSpout {
SpoutOutputCollector _collector;
Integer msgID = 0;
#Override
public void nextTuple() {
Random _rand = new Random();
String[] sentences = new String[] { "There two things benefit",
" from Storms reliability capabilities",
"Specifying a link in the",
" tuple tree is " + "called anchoring",
" Anchoring is done at ",
"the same time you emit a " + "new tuple" };
String message = sentences[_rand.nextInt(sentences.length)];
_collector.emit(new Values(message), msgID);
System.out.println(msgID + " " + message);
msgID++;
}
#Override
public void open(Map conf, TopologyContext context,
SpoutOutputCollector collector) {
System.out.println("open");
_collector = collector;
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("LINE"));
}
#Override
public void ack(Object msgID) {
System.out.println("ack ------------------- " + msgID);
}
#Override
public void fail(Object msgID) {
System.out.println("fail ----------------- " + msgID);
}
#Override
public void activate() {
// TODO Auto-generated method stub
}
#Override
public void close() {
}
#Override
public void deactivate() {
// TODO Auto-generated method stub
}
#Override
public Map<String, Object> getComponentConfiguration() {
// TODO Auto-generated method stub
return null;
}
}
public static class SplitSentence extends BaseRichBolt {
OutputCollector _collector;
public void prepare(Map conf, TopologyContext context,
OutputCollector collector) {
_collector = collector;
}
public void execute(Tuple tuple) {
String sentence = tuple.getString(0);
for (String word : sentence.split(" ")) {
System.out.println(word);
_collector.emit(tuple, new Values(word));
}
//_collector.ack(tuple);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
System.out.println("WordCount MSGID : " + tuple.getMessageId());
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
System.out.println(word + " ===> " + count);
counts.put(word, count);
collector.emit(new Values(word, count));
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new WSpout(), 2);
builder.setBolt("split", new SplitSentence(), 2).shuffleGrouping(
"spout");
builder.setBolt("count", new WordCount(), 2).fieldsGrouping("split",
new Fields("word"));
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) {
conf.setNumWorkers(1);
StormSubmitter.submitTopology(args[0], conf,
builder.createTopology());
} else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf, builder.createTopology());
Thread.sleep(10000);
cluster.shutdown();
}
}
}
WordCount extends BaseBasicBolt which ensures the tuples are acked automatically IN THAT BOLT, like you stated in your comment. However, SplitSentence extends BaseRichBolt which requires you to ack tuples manually. You're not acking, so tuples time out.

ClassCast Error while writing to Cassandra from hadoop job

I am running a hadoop job and trying to write the output to Cassandra. I am getting following exception:
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to java.nio.ByteBuffer
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter.write(ColumnFamilyRecordWriter.java:60)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:514)
at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Reducer.reduce(Reducer.java:156)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
I modeled my map reduce code on the WordCount example given at https://wso2.org/repos/wso2/trunk/carbon/dependencies/cassandra/contrib/word_count/src/WordCount.java
Here's my MR code:
public class SentimentAnalysis extends Configured implements Tool {
static final String KEYSPACE = "Travel";
static final String OUTPUT_COLUMN_FAMILY = "Keyword_PtitleId";
public static class Map extends Mapper<LongWritable, Text, Text, LongWritable> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
Sentiment sentiment = null;
try {
sentiment = (Sentiment) PojoMapper.fromJson(line, Sentiment.class);
} catch(Exception e) {
return;
}
if(sentiment != null && sentiment.isLike()) {
word.set(sentiment.getNormKeyword());
context.write(word, new LongWritable(sentiment.getPtitleId()));
}
}
}
public static class Reduce extends Reducer<Text, LongWritable, ByteBuffer, List<Mutation>> {
private ByteBuffer outputKey;
public void reduce(Text key, Iterator<LongWritable> values, Context context) throws IOException, InterruptedException {
List<Long> ptitles = new ArrayList<Long>();
java.util.Map<Long, Integer> ptitleToFrequency = new HashMap<Long, Integer>();
while (values.hasNext()) {
Long value = values.next().get();
ptitles.add(value);
}
for(Long ptitle : ptitles) {
if(ptitleToFrequency.containsKey(ptitle)) {
ptitleToFrequency.put(ptitle, ptitleToFrequency.get(ptitle) + 1);
}
else {
ptitleToFrequency.put(ptitle, 1);
}
}
byte[] keyBytes = key.getBytes();
outputKey = ByteBuffer.wrap(Arrays.copyOf(keyBytes, keyBytes.length));
for(Long ptitle : ptitleToFrequency.keySet()) {
context.write(outputKey, Collections.singletonList(getMutation(new Text(ptitle.toString()), ptitleToFrequency.get(ptitle))));
}
}
private static Mutation getMutation(Text word, int sum)
{
Column c = new Column();
byte[] wordBytes = word.getBytes();
c.name = ByteBuffer.wrap(Arrays.copyOf(wordBytes, wordBytes.length));
c.value = ByteBuffer.wrap(String.valueOf(sum).getBytes());
c.timestamp = System.currentTimeMillis() * 1000;
Mutation m = new Mutation();
m.column_or_supercolumn = new ColumnOrSuperColumn();
m.column_or_supercolumn.column = c;
return m;
}
}
public static void main(String[] args) throws Exception {
int ret = ToolRunner.run(new SentimentAnalysis(), args);
System.exit(ret);
}
#Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "SentimentAnalysis");
job.setJarByClass(SentimentAnalysis.class);
String inputFile = args[0];
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(ByteBuffer.class);
job.setOutputValueClass(List.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
job.setInputFormatClass(TextInputFormat.class);
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, OUTPUT_COLUMN_FAMILY);
FileInputFormat.setInputPaths(job, inputFile);
ConfigHelper.setRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
}
If you look under the Reduce class, I am converting Text field (key) to ByteBuffer properly.
Would appreciate some pointers on how to fix this.
After some trial and error, I was able to figure out how to solve this particular issue. Basically, in my reduce method signature, I was using Iterator instead of Iterable and so the reducer was never called. And, hadoop was trying to write my Mapper output (Text, LongWritable) to Cassandra using outputKey/Value Classes for Reducer (ByteBuffer, List). This was causing the ClassCastException.
Changing reduce method signature to Iterable solved this issue.

Categories

Resources