I am running the following Spark program in Java. I am getting data from a Kafka cluster. The problem is that this program is running forever. It is taking RDDs even when there are only three lines of data in Kafka. How do I stop it from taking in more RDDs after the data has been consumed?
The print statements are also not displaying for some reason. I don't know why.
public final class SparkKafkaConsumer {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = 3;
Map<String, Integer> topicMap = new HashMap<String, Integer>();
String[] topics = "dddccceeffg".split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, "localhost:2181", "NameConsumer", topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Lists.newArrayList(COMMA.split(x));
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.print();
jssc.start();
jssc.awaitTermination();
}
}
My terminal command is: C:\spark-1.6.2-bin-hadoop2.6\bin\spark-submit --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.2 --class "SparkKafkaConsumer" --master local[4] target\simple-project-1.0.jar
Related
First of all let me describe the scenario.
Step 1. I have to read from a file, line by line. The file is a .json and each line has the following format:
{
"schema":{Several keys that are to be deleted},
"payload":{"key1":20001,"key2":"aaaa","key3":"bbbb","key4":"USD","key5":"100"}
}
Step 2. Delete schema object and end up with (added more examples for the sake of the next steps):
{"key1":20001,"key2":"aaaa","key3":"bbbb","key4":"USD","key5":"100"}
{"key1":20001,"key2":"aaaa","key3":"bbbb","key4":"US","key5":"90"}
{"key1":2002,"key2":"cccc","key3":"hhhh","key4":"CN","key5":"80"}
Step 3. Split these values into key and value by making them json in memory and use the strings as keys and values with map
{"key1":20001,"key2":"aaaa","key3":"bbbb"} = {"key4":"USD","key5":"100"}
{"key1":20001,"key2":"aaaa","key3":"bbbb"} = {"key4":"US","key5":"90"}
{"key1":2002,"key2":"cccc","key3":"hhhh"} = {"key4":"CN","key5":"80"}
Step 4, and the one I can't work out due to my lack of knowledge in Pcollections. I need to grab all the lines read and do a GroupByKey so that it would end up like:
{"key1":20001,"key2":"aaaa","key3":"bbbb"} = [
{"key4":"USD","key5":"100"},
{"key4":"US","key5":"90"} ]
{"key1":2002,"key2":"cccc","key3":"hhhh"} = {"key4":"CN","key5":"80"}
Righ now my code looks like this:
static void runSimplePipeline(PipelineOptionsCustom options) {
Pipeline p = Pipeline.create(options);
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply("TransformData", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
Gson gson = new GsonBuilder().create();
ObjectMapper oMapper = new ObjectMapper();
JSONObject obj_key = new JSONObject();
JSONObject obj_value = new JSONObject();
List<String> listMainKeys = Arrays.asList(new String[]{"Key1", "Key2", "Key3"});
HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class);
parsedMap.remove("schema");
Map<String, String> map = oMapper.convertValue(parsedMap.get("payload"), Map.class);
for (Map.Entry<String,String> entry : map.entrySet()) {
if (listMainKeys.contains(entry.getKey())) {
obj_key.put(entry.getKey(),entry.getValue());
} else {
obj_value.put(entry.getKey(),entry.getValue());
}
}
KV objectKV = KV.of(obj_key.toJSONString(), obj_value.toJSONString());
System.out.print(obj_key.toString() + " : " + obj_value.toString() +"\n");
}
})); <------- RIGHT HERE
p.run().waitUntilFinish();
}
Now the obvious part is that on where it says "RIGHT HERE" I should have another apply with CountByKey however that requires a full PCollection and that's what I do not really understand.
Here's the code, thanks to Guillem Xercavins's linked Github:
static void runSimplePipeline(PipelineOptionsCustom options) {
Pipeline p = Pipeline.create(options);
PCollection<Void> results = p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply("TransformData", ParDo.of(new DoFn<String, KV<String, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Gson gson = new GsonBuilder().create();
ObjectMapper oMapper = new ObjectMapper();
JSONObject obj_key = new JSONObject();
JSONObject obj_value = new JSONObject();
List<String> listMainKeys = Arrays
.asList(new String[] { "EBELN", "AEDAT", "BATXT", "EKOTX", "Land1", "WAERS" });
HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class);
parsedMap.remove("schema");
Map<String, String> map = oMapper.convertValue(parsedMap.get("payload"), Map.class);
for (Map.Entry<String, String> entry : map.entrySet()) {
if (listMainKeys.contains(entry.getKey())) {
obj_key.put(entry.getKey(), entry.getValue());
} else {
obj_value.put(entry.getKey(), entry.getValue());
}
}
KV objectKV = KV.of(obj_key.toJSONString(), obj_value.toJSONString());
c.output(objectKV);
}
})).apply("Group By Key", GroupByKey.<String, String>create())
.apply("Continue Processing", ParDo.of(new DoFn<KV<String, Iterable<String>>, Void>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.print(c.element());
}
}));
p.run().waitUntilFinish();
}
The spark consumer have to read topics with same name from different Bootstrap servers. So in need to create two JavaDstreams, performing union, process the stream and commit the offsets.
JavaInputDStream<ConsumerRecord<String, GenericRecord>> dStream = KafkaUtils.createDirectStream(...);
Problem is JavaInputDStream doesn't support dStream.Union(stream2);
If i use,
JavaDStream<ConsumerRecord<String, GenericRecord>> dStream= KafkaUtils.createDirectStream(...);
But JavaDstream doesn't support,
((CanCommitOffsets) dStream.inputDStream()).commitAsync(os);
Please bare with the long answer.
There is no direct way to do this which i am aware of so, I would like to first convert the Dstreams to Datasets/Dataframes and then perform a UNION on both of the dataframes/datasets.
The below code is not tested but this should works. Please feel free to validate and do the necessary changes to make it work.
JavaPairInputDStream<String, String> pairDstream1 = KafkaUtils.createDirectStream(ssc,kafkaParams, topics);
JavaPairInputDStream<String, String> pairDstream2 = KafkaUtils.createDirectStream(ssc,kafkaParams, topics);
//Create JavaDStream<String>
JavaDStream<String> dstream1 = pairDstream1.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
//Create JavaDStream<String>
JavaDStream<String> dstream1 = pairDstream2.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
//Create JavaRDD<Row>
pairDstream1.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create JavaRDD<Row>
pairDstream2.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {DataTypes.createStructField("Message", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> df1 = spark.createDataFrame(rowRDD, schema);
Dataset<Row> df2 = spark.createDataFrame(rowRDD, schema);
//union the both dataframes
df1.union(df2);
I'm looking for a way to write a Dstream in an output kafka topic, only when the micro-batch RDDs spit out something.
I'm using Spark Streaming and spark-streaming-kafka connector in Java8 (both latest versions)
I cannot figure out.
Thanks for the help.
if dStream contains data that you want to send to Kafka:
dStream.foreachRDD(rdd -> {
rdd.foreachPartition(iter ->{
Producer producer = createKafkaProducer();
while (iter.hasNext()){
sendToKafka(producer, iter.next())
}
}
});
So, you create one producer per each RDD partition.
In my example I want to send events took from a specific kafka topic to another one. I do a simple wordcount. That means, I take data from kafka input topic, count them and output them in a output kafka topic. Don't forget the goal is to write results of JavaPairDStream into output kafka topic using Spark Streaming.
//Spark Configuration
SparkConf sparkConf = new SparkConf().setAppName("SendEventsToKafka");
String brokerUrl = "locahost:9092"
String inputTopic = "receiverTopic";
String outputTopic = "producerTopic";
//Create the java streaming context
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
//Prepare the list of topics we listen for
Set<String> topicList = new TreeSet<>();
topicList.add(inputTopic);
//Kafka direct stream parameters
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", brokerUrl);
kafkaParams.put("group.id", "kafka-cassandra" + new SecureRandom().nextInt(100));
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
//Kafka output topic specific properties
Properties props = new Properties();
props.put("bootstrap.servers", brokerUrl);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("acks", "1");
props.put("retries", "3");
props.put("linger.ms", 5);
//Here we create a direct stream for kafka input data.
final JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topicList, kafkaParams));
JavaPairDStream<String, String> results = messages
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
#Override
public Tuple2<String, String> call(ConsumerRecord<String, String> record) {
return new Tuple2<>(record.key(), record.value());
}
});
JavaDStream<String> lines = results.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterator<String> call(String x) {
log.info("Line retrieved {}", x);
return Arrays.asList(SPACE.split(x)).iterator();
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) {
log.info("Word to count {}", s);
return new Tuple2<>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
log.info("Count with reduceByKey {}", i1 + i2);
return i1 + i2;
}
});
//Here we iterrate over the JavaPairDStream to write words and their count into kafka
wordCounts.foreachRDD(new VoidFunction<JavaPairRDD<String, Integer>>() {
#Override
public void call(JavaPairRDD<String, Integer> arg0) throws Exception {
Map<String, Integer> wordCountMap = arg0.collectAsMap();
List<WordOccurence> topicList = new ArrayList<>();
for (String key : wordCountMap.keySet()) {
//Here we send event to kafka output topic
publishToKafka(key, wordCountMap.get(key), outputTopic);
}
JavaRDD<WordOccurence> WordOccurenceRDD = jssc.sparkContext().parallelize(topicList);
CassandraJavaUtil.javaFunctions(WordOccurenceRDD)
.writerBuilder(keyspace, table, CassandraJavaUtil.mapToRow(WordOccurence.class))
.saveToCassandra();
log.info("Words successfully added : {}, keyspace {}, table {}", words, keyspace, table);
}
});
jssc.start();
jssc.awaitTermination();
wordCounts variable is of type JavaPairDStream<String, Integer>, I just ierrate using foreachRDD and write into kafka using a specific function:
public static void publishToKafka(String word, Long count, String topic, Properties props) {
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props);
try {
ObjectMapper mapper = new ObjectMapper();
String jsonInString = mapper.writeValueAsString(word + " " + count);
String event = "{\"word_stats\":" + jsonInString + "}";
log.info("Message to send to kafka : {}", event);
producer.send(new ProducerRecord<String, String>(topic, event));
log.info("Event : " + event + " published successfully to kafka!!");
} catch (Exception e) {
log.error("Problem while publishing the event to kafka : " + e.getMessage());
}
producer.close();
}
Hope that helps!
I am facing a problem in which i have to find out the largest line and its index. Here is my approach
SparkConf conf = new SparkConf().setMaster("local").setAppName("basicavg");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> rdd = sc.textFile("/home/impadmin/ravi.txt");
JavaRDD<Tuple2<Integer,String>> words = rdd.map(new Function<String, Tuple2<Integer,String>>() {
#Override
public Tuple2<Integer,String> call(String v1) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Integer, String>(v1.split(" ").length, v1);
}
});
JavaPairRDD<Integer, String> linNoToWord = JavaPairRDD.fromJavaRDD(words).sortByKey(false);
System.out.println(linNoToWord.first()._1+" ********************* "+linNoToWord.first()._2);
In this way the tupleRDD will get sorted on the basis of key and the first element in the new rdd after sorting is of highest length:
JavaRDD<String> rdd = sc.textFile("/home/impadmin/ravi.txt");
JavaRDD<Tuple2<Integer,String>> words = rdd.map(new Function<String, Tuple2<Integer,String>>() {
#Override
public Tuple2<Integer,String> call(String v1) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Integer, String>(v1.split(" ").length, v1);
}
});
JavaRDD<Tuple2<Integer,String>> tupleRDD1= tupleRDD.sortBy(new Function<Tuple2<Integer,String>, Integer>() {
#Override
public Integer call(Tuple2<Integer, String> v1) throws Exception {
// TODO Auto-generated method stub
return v1._1;
}
}, false, 1);
System.out.println(tupleRDD1.first());
}
Since you are concerned with the line number and text both, please try this.
First create a serializable class Line :
public static class Line implements Serializable {
public Line(Long lineNo, String text) {
lineNo_ = lineNo;
text_ = text;
}
public Long lineNo_;
public String text_;
}
Then do the following operations:
SparkConf conf = new SparkConf().setMaster("local[1]").setAppName("basicavg");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> rdd = sc.textFile("/home/impadmin/words.txt");
JavaPairRDD<Long, Line> linNoToWord2 = rdd.zipWithIndex().mapToPair(new PairFunction<Tuple2<String,Long>, Long, Line>() {
public Tuple2<Long, Line> call(Tuple2<String, Long> t){
return new Tuple2<Long, Line>(Long.valueOf(t._1.split(" ").length), new Line(t._2, t._1));
}
}).sortByKey(false);
System.out.println(linNoToWord2.first()._1+" ********************* "+linNoToWord2.first()._2.text_);
I am using the Spark Kafka connector to fetch data from Kafka cluster. From it, I am getting the data as a JavaDStream<String>. How do I get the data as a JavaDStream<EventLog>, where EventLog is a Java bean?
public static JavaDStream<EventLog> fetchAndValidateData(String zkQuorum, String group, Map<String, Integer> topicMap) {
SparkConf sparkConf = new SparkConf().setAppName("JavaKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, zkQuorum, group, topicMap);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
jssc.start();
jssc.awaitTermination();
return lines;
}
My goal is to save this data into Cassandra where a table with the same specifications as EventLog. The Spark Cassandra connector accepts JavaRDD<EventLog> in the insert statement like this: javaFunctions(rdd).writerBuilder("ks", "event", mapToRow(EventLog.class)).saveToCassandra();. I want to get these JavaRDD<EventLog> from Kafka.
Use the overloaded createStream method where you can pass the key/value type and decoder classes.
Example:
createStream(jssc, String.class, EventLog.class, StringDecoder.class, EventLogDecoder.class,
kafkaParams, topicsMap, StorageLevel.MEMORY_AND_DISK_SER_2());
Above should give you JavaPairDStream<String, EventLog>
JavaDStream<EventLog> lines = messages.map(new Function<Tuple2<String, EventLog>, EventLog>() {
#Override
public EventLog call(Tuple2<String, EventLog> tuple2) {
return tuple2._2();
}
});
The EventLogDecoder should implement kafka.serializer.Decoder. Below example for json decoder.
public class EventLogDecoder implements Decoder<EventLog> {
public EventLogDecoder(VerifiableProperties verifiableProperties) {
}
#Override
public EventLog fromBytes(byte[] bytes) {
ObjectMapper objectMapper = new ObjectMapper();
try {
return objectMapper.readValue(bytes, EventLog.class);
} catch (IOException e) {
//do something
}
return null;
}
}