Spark save Kafka InputDStream as Json file - java

I just wondered whether there is a method in Spark, so I can save a JavaInputDStream as a Json file, or generally as any file.
And if not, whether there is an other possibility to save the content of
a kafka topic as a file in Spark.
Thank you very much!

As you map your JavaInputDStream to a stream you could do as follow:
stream.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
rdd.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
#Override
public Tuple2<String, String> call(ConsumerRecord<String, String> record) {
return new Tuple2<>(record.key(), record.value());
}
}).foreachPartition(partition -> {
OffsetRange o = offsetRanges[TaskContext.get().partitionId()];
System.out.println(o.topic() + " " + o.partition() + " " + o.fromOffset() + " " + o.untilOffset());
if (partition.hasNext()) {
PrintWriter out = new PrintWriter("filename.txt");;
out.println(text);
try {
while (partition.hasNext()) {
Tuple2<String, String> message = partition.next();
out.println(message);
}
} catch (Exception e) {
e.printStackTrace(
}
});
});
ssc.start();
ssc.awaitTermination();
Just don't forget that if you have multiple partitions inside your Kafka topic, you are going to write a file per partition following the approach above.

Related

Calculate delta Offsets Kafka Java

In a spring project i used Kafka and now I want to make a method which takes "TopicName" and "GroupeId" as parameters
and calculate the difference between "Lastoffsets of the topic partitions" and the "offsets consumed by the group"
for the lastOffsets i get it
now i need to get the consumed offsets to calculate the difference
public ResponseEntity<Offsets> deltaoffsets (#RequestParam( name = "groupId") String groupId, #RequestParam( name = "topic") String topic) {
Map<String,Object> properties = (Map) kafkaLocalConsumerConfig.get("kafkaLocalConsumerConfig");
properties.put("group.id", groupId);
properties.put("enable.auto.commit", "true");
List<TopicPartition> partition=new ArrayList<>();
KafkaConsumer<String, RefentialToReload> kafkaLocalConsumer = new KafkaConsumer<>(properties);
Map<String, List<PartitionInfo>> topics = kafkaLocalConsumer.listTopics();
List<PartitionInfo> partitionInfos = topics.get(topic);
if (partitionInfos == null) {
log.warn("Partition information was not found for topic");
}
else {
for (PartitionInfo partitionInfo : partitionInfos) {
TopicPartition topicPartition = new TopicPartition(topic, partitionInfo.partition());
partition.add(topicPartition);
log.info("partition assigned to kafkaLocalConsumer");
}
}
//get lastOffsets of the topicPartition
Map<TopicPartition,Long> OffsetsTopicpartition = kafkaLocalConsumer.endOffsets(kafkaLocalConsumer.assignment());
//here i need to get consumed offsets
}
beginningOffsets() is the first offsets, not the last.
You can use an AdminClient - here is an example that displays the current and end offsets...
#Bean
public ApplicationRunner runner(KafkaAdmin admin, ConsumerFactory<String, String> cf) throws Exception {
return args -> {
try (
AdminClient client = AdminClient.create(admin.getConfig());
Consumer<String, String> consumer = cf.createConsumer("group", "clientId", "");
) {
Collection<ConsumerGroupListing> groups = client.listConsumerGroups()
.all()
.get(10, TimeUnit.SECONDS);
groups.forEach(group -> {
Map<TopicPartition, OffsetAndMetadata> map = null;
try {
map = client.listConsumerGroupOffsets(group.groupId())
.partitionsToOffsetAndMetadata()
.get(10, TimeUnit.SECONDS);
}
catch (InterruptedException e) {
e.printStackTrace();
Thread.currentThread().interrupt();
}
catch (ExecutionException e) {
e.printStackTrace();
}
catch (TimeoutException e) {
e.printStackTrace();
}
Map<TopicPartition, Long> endOffsets = consumer.endOffsets(map.keySet());
map.forEach((tp, off) -> {
System.out.println("group: " + group + " tp: " + tp
+ " current offset: " + off.offset()
+ " end offset: " + endOffsets.get(tp));
});
});
}
};
}

How can I get another HasMap from entry.getValue?

From For each loop, inside entry.getValue(), There is another map coming from firestore. How can I get?
Code:
#Override
public void onComplete (#NonNull Task < QuerySnapshot > task) {
if (task.isSuccessful()) {
// binding.contentMain.noData.setVisibility(View.GONE);
for (QueryDocumentSnapshot document : Objects.requireNonNull(task.getResult())) {
showLog("Data: " + document.getData());
postMap.putAll(document.getData());
}
try {
for (Map.Entry<String, Object> entry : postMap.entrySet()) {
if (entry.getKey().equals(CONTENT)) {
showLog("value: " + entry.getValue().toString());
contentMap = new HashMap<>();
contentMap.putAll(entry.getValue());
}
}
} catch (Exception e) {
e.printStackTrace();
}
} else {
// binding.contentMain.noData.setVisibility(View.VISIBLE);
showLog("Error getting documents: " + task.getException());
}
}
I tried like below but compiler error. No suugestion:
Map map = new HashMap();
((Map)map.get( "keyname" )).get( "nestedkeyname" );
try this
contentMap.putAll((Map<? extends String, ? extends Object>) entry.getValue());
using your variable naming, try this
entry.iterator().next()
So if you have hashmap A:
{"key1":"value1",
"key2":"value2",
"key3":"value3"}
given the entryset of this hashMap "entrySet" you could do
entrySet.iterator().next()
which if used in a while loop can iterate throught all the key and values in the hashmap

Read from splunk source and write to topic - writing same record. not pulling latest records

same record is being written to topic. not pulling latest records from splunk. time parameters are set in start method to pull last one min data. Any inputs.
currently i dont set offset from source. when poll is run every time, does it look for source offset and then poll? in logs can we have time as offset.
#Override
public List<SourceRecord> poll() throws InterruptedException {
List<SourceRecord> results = new ArrayList<>();
Map<String, String> recordProperties = new HashMap<String, String>();
while (true) {
try {
String line = null;
InputStream stream = job.getResults(previewArgs);
String earlierKey = null;
String value = null;
ResultsReaderCsv csv = new ResultsReaderCsv(stream);
HashMap<String, String> event;
while ((event = csv.getNextEvent()) != null) {
for (String key: event.keySet()) {
if(key.equals("rawlogs")){
recordProperties.put("rawlogs", event.get(key)); results.add(extractRecord(Splunklog.SplunkLogSchema(), line, recordProperties));
return results;}}}
csv.close();
stream.close();
Thread.sleep(500);
} catch(Exception ex) {
System.out.println("Exception occurred : " + ex);
}
}
}
private SourceRecord extractRecord(Schema schema, String line, Map<String, String> recordProperties) {
Map<String, String> sourcePartition = Collections.singletonMap(FILENAME_FIELD, FILENAME);
Map<String, String> sourceOffset = Collections.singletonMap(POSITION_FIELD, recordProperties.get(OFFSET_KEY));
return new SourceRecord(sourcePartition, sourceOffset, TOPIC_NAME, schema, recordProperties);
}
#Override
public void start(Map<String, String> properties) {
try {
config = new SplunkSourceTaskConfig(properties);
} catch (ConfigException e) {
throw new ConnectException("Couldn't start SplunkSourceTask due to configuration error", e);
}
HttpService.setSslSecurityProtocol(SSLSecurityProtocol.TLSv1_2);
Service service = new Service("splnkip", port);
String credentials = "user:pwd";
String basicAuthHeader = Base64.encode(credentials.getBytes());
service.setToken("Basic " + basicAuthHeader);
String startOffset = readOffset();
JobArgs jobArgs = new JobArgs();
if (startOffset != null) {
log.info("-------------------------------task OFFSET!NULL ");
jobArgs.setExecutionMode(JobArgs.ExecutionMode.BLOCKING);
jobArgs.setSearchMode(JobArgs.SearchMode.NORMAL);
jobArgs.setEarliestTime(startOffset);
jobArgs.setLatestTime("now");
jobArgs.setStatusBuckets(300);
} else {
log.info("-------------------------------task OFFSET=NULL ");
jobArgs.setExecutionMode(JobArgs.ExecutionMode.BLOCKING);
jobArgs.setSearchMode(JobArgs.SearchMode.NORMAL);
jobArgs.setEarliestTime("+419m");
jobArgs.setLatestTime("+420m");
jobArgs.setStatusBuckets(300);
}
String mySearch = "search host=search query";
job = service.search(mySearch, jobArgs);
while (!job.isReady()) {
try {
Thread.sleep(500);
} catch (InterruptedException ex) {
log.error("Exception occurred while waiting for job to start: " + ex);
}
}
previewArgs = new JobResultsPreviewArgs();
previewArgs.put("output_mode", "csv");
stop = new AtomicBoolean(false);
}

DataFlow Apache Beam Java JdbcIO Read arguments issue

I am totally new to Apache Beam and Java.
Been working on PHP for around 5 years but i haven't worked in Java for the last 5 years :), plus Apache Beam SDK in java is something that is also new so bear with me.
I would like to implement pipeline where i will get data from Google PubSub, map the relevant fields into array and then check it to MySql Db to see if the message belong to one table, after that i will need to send api call to our API that will update some data in our app db. Another pipeline will enrich the data from elasticsearch and insert it into BigQuery.
But as of this moment i am stuck with reading data from MySql, i simply cannot adopt the argument in PCollection using JdbcIO.
My plan is to check if in Mysql table is present value that i get from pubsub ( value listid ).
Here is my code so far, any help will be appreciated.
Pipeline p = Pipeline.create(options);
org.apache.beam.sdk.values.PCollection<PubsubMessage> messages = p.apply(PubsubIO.readMessagesWithAttributes()
.fromSubscription("*******"));
org.apache.beam.sdk.values.PCollection<String> messages2 = messages.apply("GetPubSubEvent",
ParDo.of(new DoFn<PubsubMessage, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
Map<String, String> Map = new HashMap<String, String>();
PubsubMessage message = c.element();
String messageText = new String(message.getPayload(), StandardCharsets.UTF_8);
JSONObject jsonObj = new JSONObject(messageText);
String requestURL = jsonObj.getJSONObject("httpRequest").getString("requestUrl");
String query = requestURL.split("\\?")[1];
final Map<String, String> querymap = Splitter.on('&').trimResults().withKeyValueSeparator("=")
.split(query);
JSONObject querymapJson = new JSONObject(querymap);
int subscriberid = 0;
int listid = 0;
int statid = 0;
int points = 0;
String stattype = "";
String requesttype = "";
try {
subscriberid = querymapJson.getInt("emp_uid");
} catch (Exception e) {
}
try {
listid = querymapJson.getInt("emp_lid");
} catch (Exception e) {
}
try {
statid = querymapJson.getInt("emp_statid");
} catch (Exception e) {
}
try {
stattype = querymapJson.getString("emp_stattype");
Map.put("stattype", stattype);
} catch (Exception e) {
}
try {
requesttype = querymapJson.getString("type");
} catch (Exception e) {
}
try {
statid = querymapJson.getInt("leadscore");
} catch (Exception e) {
}
Map.put("subscriberid", String.valueOf(subscriberid));
Map.put("listid", String.valueOf(listid));
Map.put("statid", String.valueOf(statid));
Map.put("requesttype", requesttype);
Map.put("leadscore", String.valueOf(points));
Map.put("requestip", jsonObj.getJSONObject("httpRequest").getString("remoteIp"));
System.out.print("Hello from message 1");
c.output(Map.toString());
}
}));
org.apache.beam.sdk.values.PCollection<String> messages3 = messages2.apply("Test",
ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.println(c.element());
System.out.print("Hello from message 2");
}
}));
org.apache.beam.sdk.values.PCollection<KV<String, String>> messages23 = messages2.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("org.apache.derby.jdbc.ClientDriver",
"jdbc:derby://localhost:1527/beam"))
.withQuery("select * from artist").withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
KV<String, String> kv = KV.of(resultSet.getString("label"), resultSet.getString("name"));
return kv;
}
#Override
public KV<String, String> mapRow(java.sql.ResultSet resultSet) throws Exception {
KV<String, String> kv = KV.of(resultSet.getString("label"), resultSet.getString("name"));
return kv;
}
}).withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())));
p.run().waitUntilFinish();

How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

I'm looking for a way to write a Dstream in an output kafka topic, only when the micro-batch RDDs spit out something.
I'm using Spark Streaming and spark-streaming-kafka connector in Java8 (both latest versions)
I cannot figure out.
Thanks for the help.
if dStream contains data that you want to send to Kafka:
dStream.foreachRDD(rdd -> {
rdd.foreachPartition(iter ->{
Producer producer = createKafkaProducer();
while (iter.hasNext()){
sendToKafka(producer, iter.next())
}
}
});
So, you create one producer per each RDD partition.
In my example I want to send events took from a specific kafka topic to another one. I do a simple wordcount. That means, I take data from kafka input topic, count them and output them in a output kafka topic. Don't forget the goal is to write results of JavaPairDStream into output kafka topic using Spark Streaming.
//Spark Configuration
SparkConf sparkConf = new SparkConf().setAppName("SendEventsToKafka");
String brokerUrl = "locahost:9092"
String inputTopic = "receiverTopic";
String outputTopic = "producerTopic";
//Create the java streaming context
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
//Prepare the list of topics we listen for
Set<String> topicList = new TreeSet<>();
topicList.add(inputTopic);
//Kafka direct stream parameters
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", brokerUrl);
kafkaParams.put("group.id", "kafka-cassandra" + new SecureRandom().nextInt(100));
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
//Kafka output topic specific properties
Properties props = new Properties();
props.put("bootstrap.servers", brokerUrl);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("acks", "1");
props.put("retries", "3");
props.put("linger.ms", 5);
//Here we create a direct stream for kafka input data.
final JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topicList, kafkaParams));
JavaPairDStream<String, String> results = messages
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
#Override
public Tuple2<String, String> call(ConsumerRecord<String, String> record) {
return new Tuple2<>(record.key(), record.value());
}
});
JavaDStream<String> lines = results.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterator<String> call(String x) {
log.info("Line retrieved {}", x);
return Arrays.asList(SPACE.split(x)).iterator();
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) {
log.info("Word to count {}", s);
return new Tuple2<>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
log.info("Count with reduceByKey {}", i1 + i2);
return i1 + i2;
}
});
//Here we iterrate over the JavaPairDStream to write words and their count into kafka
wordCounts.foreachRDD(new VoidFunction<JavaPairRDD<String, Integer>>() {
#Override
public void call(JavaPairRDD<String, Integer> arg0) throws Exception {
Map<String, Integer> wordCountMap = arg0.collectAsMap();
List<WordOccurence> topicList = new ArrayList<>();
for (String key : wordCountMap.keySet()) {
//Here we send event to kafka output topic
publishToKafka(key, wordCountMap.get(key), outputTopic);
}
JavaRDD<WordOccurence> WordOccurenceRDD = jssc.sparkContext().parallelize(topicList);
CassandraJavaUtil.javaFunctions(WordOccurenceRDD)
.writerBuilder(keyspace, table, CassandraJavaUtil.mapToRow(WordOccurence.class))
.saveToCassandra();
log.info("Words successfully added : {}, keyspace {}, table {}", words, keyspace, table);
}
});
jssc.start();
jssc.awaitTermination();
wordCounts variable is of type JavaPairDStream<String, Integer>, I just ierrate using foreachRDD and write into kafka using a specific function:
public static void publishToKafka(String word, Long count, String topic, Properties props) {
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(props);
try {
ObjectMapper mapper = new ObjectMapper();
String jsonInString = mapper.writeValueAsString(word + " " + count);
String event = "{\"word_stats\":" + jsonInString + "}";
log.info("Message to send to kafka : {}", event);
producer.send(new ProducerRecord<String, String>(topic, event));
log.info("Event : " + event + " published successfully to kafka!!");
} catch (Exception e) {
log.error("Problem while publishing the event to kafka : " + e.getMessage());
}
producer.close();
}
Hope that helps!

Categories

Resources