Count number of message/event in Kafka stream at a periodic level - java

I have created one Kafka stream by consuming the message from one Kafka topic. I want to count what is the number of messages that I have received at a 1-minute level.
So let's say, I have got the message in the following way:
t1 -> message1
t1 -> message2
t1 -> message3
After 1 minute I receive the message say like this
t2 -> message4
t2 -> message5
Let's say I have one integer variable count in my Java application. What I want is from the start of the application till 1 minute this count value should be 3. At the end of the second minute, this count variable should become 2. This is because at the first minute I Had received 3 messages and in the second minute I had received 2 messages.
My code so far
import lombok.SneakyThrows;
import org.apache.commons.lang3.StringUtils;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.ForeachAction;
import org.apache.kafka.streams.kstream.KStream;
import java.util.Properties;
public class CountMessage {
private static KafkaStreams kafkaStreams;
public static void main(String[] args) {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "my_first_count_2");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "10.0.0.43:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.Long().getClass());
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, MyTimestampExtractor.class);
final StreamsBuilder streamsBuilder = new StreamsBuilder();
// consuming stream
String kafkaTopic = "my_kafka_topic_2";
System.out.println("Starting the application");
KStream<String, String> myStream = streamsBuilder
.stream(kafkaTopic);
myStream.foreach(new ForeachAction<String, String>() {
#SneakyThrows
#Override
public void apply(String key, String value) {
System.out.println("key received = " + key + "---<<<" + value);
}
});
final Topology topology = streamsBuilder.build();
kafkaStreams = new KafkaStreams(topology, props);
kafkaStreams.start();
}
}

Not sure if you're tied to using Kafka Streams, but for what it's worth you can do this with ksqlDB:
SELECT TIMESTAMPTOSTRING(WINDOWSTART,'yyyy-MM-dd HH:mm:ss') AS TS,
COUNT(*) AS MSG_COUNT
FROM SRC_STREAM
WINDOW TUMBLING (SIZE 1 MINUTE)
GROUP BY 'X'
EMIT CHANGES;

Related

Kafka Streams Twitter Wordcount - Count Value not Long after Serialization

I am running a Kafka Cluster Docker Compose on an AWS EC2 instance.
I want to receive all the tweets of a specific keyword and push them to Kafka. This works fine.
But I also want to count the most used words of those tweets.
This is the WordCount code:
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.StreamsBuilder;
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;
import java.util.concurrent.CountDownLatch;
import static org.apache.kafka.streams.StreamsConfig.APPLICATION_ID_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.BOOTSTRAP_SERVERS_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG;
import static org.apache.kafka.streams.StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG;
public class WordCount {
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, String> textLines = builder
.stream("test-topic");
textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupBy((key, value) -> value)
.count(Materialized.as("WordCount"))
.toStream()
.to("test-output", Produced.with(Serdes.String(), Serdes.Long()));
final Topology topology = builder.build();
Properties props = new Properties();
props.put(APPLICATION_ID_CONFIG, "streams-word-count");
props.put(BOOTSTRAP_SERVERS_CONFIG, "ec2-ip:9092");
props.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(
new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
}
When I check the output topic in the Control Center, it looks like this:
Key
Value
Looks like it's working as far as splitting the tweets into single words. But the count value isn't in Long format, although it is specified in the code.
When I use the kafka-console-consumer to consume from this topic, it says:
"Size of data received by LongDeserializer is not 8"
Control Center UI and console consumer can only render UTF8 data, by default.
You'll need to explicitly pass LongDeserializer to the console consumer, as the value deserializer only
try a KTable instead:
KStream<String, String> textLines = builder.stream("test-topic", Consumed.with(stringSerde, stringSerde));
KTable<String, Long> wordCounts = textLines
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.groupBy((key, value) -> value)
.count()
.toStream()
.to("test-output", Produced.with(Serdes.String(), Serdes.Long()));

TestOutputTopic.readKeyValuesToMap() removes messages from tested topic. How to do intermediate assertions during the test?

While using TopologyTestDriver I want to test my stream and do assrtions on intermediate state between incoming messages. But after using TestOutputTopic.readKeyValuesToMap() tested topic is cleared. How to "peek" and do assertions between messages?
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.Grouped;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.test.TestRecord;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;
import java.util.Properties;
public class AggregationTest {
private static TestInputTopic<String, String> inputTopic;
private static TestOutputTopic<String, String> outputTopic;
private static TopologyTestDriver testDriver;
#BeforeAll
public static void setup() {
StreamsBuilder builder = new StreamsBuilder();
builder
.stream("inputTopic", Consumed.with(Serdes.String(), Serdes.String()))
.toTable(Materialized.with(Serdes.String(), Serdes.String()))
.groupBy(
KeyValue::pair,
Grouped.with("group-by-internal", Serdes.String(), Serdes.String()))
.aggregate(
() -> "",
(key, incomingMessage, existingMessage) -> incomingMessage + " " + existingMessage,
(key, incomingMessage, existingMessage) -> existingMessage
).toStream().to("outputTopic", Produced.with(Serdes.String(), Serdes.String()));
testDriver = new TopologyTestDriver(builder.build(), new Properties() {{
put(StreamsConfig.APPLICATION_ID_CONFIG, "test");
put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, Serdes.String().getClass().getName());
put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
put(StreamsConfig.STATE_DIR_CONFIG, "/tmp/kafka-streams");
}});
inputTopic = testDriver.createInputTopic("inputTopic", Serdes.String().serializer(), Serdes.String().serializer());
outputTopic = testDriver.createOutputTopic("outputTopic", Serdes.String().deserializer(), Serdes.String().deserializer());
}
#Test
public void testAggregation() {
TestRecord<String, String> message1 = new TestRecord<>( "Key1", "Value1");
TestRecord<String, String> message2 = new TestRecord<>( "Key2", "Value2");
TestRecord<String, String> message3 = new TestRecord<>( "Key1", "Value3");
inputTopic.pipeInput(message1);
inputTopic.pipeInput(message2);
var outputMap = outputTopic.readKeyValuesToMap();
System.out.println(outputMap); // {Key2=Value2 , Key1=Value1 }
// Assert that message1 and message2 was not effected
inputTopic.pipeInput(message3);
var outputMap2 = outputTopic.readKeyValuesToMap();
System.out.println(outputMap2); // {Key1=Value3 Value1 } // where message with Key2 disappeared?
// How to assert that message3 was merged with message1, but message2 was not effected?
}
#AfterAll
public static void tearDown() {
testDriver.close();
}
}
The JavaDoc indicates it should return the full, latest state of the topic (are you sure a tombstone event wasn't introduced, somehow?), so I am not sure why it would disappear.
If you want to aggregate the state of both maps, you can merge them rather than re-assign the previous reference, but that would fix the test, not necessarily actual runtime behavior...
You may want to revisit your aggregate function. Key2 has no existingMessage when it is originally incoming. Therefore, you've returned null there, and it would not exist in the map output. Only value you'd have is therefore Value3 Value1
Try this for instances where you only expect one value
(key, incomingMessage, existingMessage) -> existingMessage == null ? incomingMessage : existingMessage
From the docs:
readKeyValuesToMap:
"Read output to map. If the result is considered a stream, you can use readRecordsToList() instead."
The Map depict the table only containing the latest value per key. What you want is the stream of data records.
Link: https://kafka.apache.org/24/javadoc/org/apache/kafka/streams/TestOutputTopic.html#readKeyValuesToMap--
(PS sorry for the incomplete answer, I can not comment yet.)

Kafka Streams - Fields in the Custom object changing to null while doing Aggregation

I've written a simple Kafka Stream processor code to
Read messages as stream from a topic with <K, V> as <String, String>
Convert the value in the message from String to a Custom Object <String, Object> using mapValues() method
Use Window function to aggregate the statistics of the Objects for a particular time interval
Sample Message
{"coiRequestGuid":"xxxx","accountId":1122132,"companyName":"xxxx","existingPolicyCoverageLimit":1000000,"isChangeRequested":true,"newlyRequestedPolicyCoverageLimit":200000,"isNewRecipient":false,"newRecipientGuid":null,"existingRecipientId":11111,"recipientName":"xxxx","recipientEmail":"xxxxx"}
Here is my code
import com.da.app.data.model.PolicyChangeRequest;
import com.da.app.data.model.PolicyChangeRequestStats;
import com.da.app.data.serde.JsonDeserializer;
import com.da.app.data.serde.JsonSerializer;
import com.da.app.data.serde.WrapperSerde;
import com.da.app.system.util.ConfigUtil;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.utils.Bytes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.streams.state.WindowStore;
import org.apache.log4j.Logger;
import org.json.JSONObject;
import java.time.Duration;
import java.util.Properties;
public class PolicyChangeReqStreamProcessor {
private static final Logger logger = Logger.getLogger(PolicyChangeReqStreamProcessor.class);
private static final String TOPIC_NAME = "stream-window-play1";
private static ObjectMapper mapper = new ObjectMapper();
private static Properties properties = ConfigUtil.loadProperty();
public static void main(String[] args) {
logger.info("Policy Limit Change Stats Generator");
Properties streamProperties = getStreamProperties();
StreamsBuilder streamsBuilder = new StreamsBuilder();
KStream<String, String> source = streamsBuilder.stream(TOPIC_NAME,
Consumed.with(Serdes.String(), Serdes.String()));
source
.filter((key, value) -> isValidEvent(value))
//Converting the request json to PolicyChangeRequest object
.mapValues(PolicyChangeReqStreamProcessor::convertPolicyChangeReqJsonToObj)
//Mapping all events to a single key in order to group all the events
.map((key, value) -> new KeyValue<>("key", value))
// Grouping by key
.groupByKey(Grouped.with(Serdes.String(), new PolicyChangeRequestSerde()))
//Creating a Tumbling window of 5 secs (for Testing)
.windowedBy(TimeWindows.of(Duration.ofSeconds(5)).advanceBy(Duration.ofSeconds(5)))
// Aggregating the PolicyChangeRequest events to a
// PolicyChangeRequestStats object
.<PolicyChangeRequestStats>aggregate(PolicyChangeRequestStats::new,
(k, v, policyStats) -> policyStats.add(v),
Materialized.<String, PolicyChangeRequestStats, WindowStore<Bytes, byte[]>>as
("policy-change-aggregates")
.withValueSerde(new PolicyChangeRequestStatsSerde()))
//Converting KTable to KStream
.toStream()
.foreach((key, value) -> logger.info(key.window().startTime() + "----" + key.window().endTime() + " :: " + value));
KafkaStreams kafkaStreams = new KafkaStreams(streamsBuilder.build(), streamProperties);
logger.info("Started the stream");
kafkaStreams.start();
Runtime.getRuntime().addShutdownHook(new Thread(kafkaStreams::close));
}
private static PolicyChangeRequest convertPolicyChangeReqJsonToObj(String policyChangeReq) {
JSONObject policyChangeReqJson = new JSONObject(policyChangeReq);
PolicyChangeRequest policyChangeRequest = new PolicyChangeRequest(policyChangeReqJson);
// return mapper.readValue(value, PolicyChangeRequest.class);
return policyChangeRequest;
}
private static boolean isValidEvent(String value) {
//TODO: Message Validation
return true;
}
private static Properties getStreamProperties() {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "policy-change-stats-gen");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, properties.getProperty("kafka.bootstrap.servers"));
props.put(StreamsConfig.CLIENT_ID_CONFIG, "stream-window-play1");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
return props;
}
public static final class PolicyChangeRequestStatsSerde extends WrapperSerde<PolicyChangeRequestStats> {
PolicyChangeRequestStatsSerde() {
super(new JsonSerializer<>(), new JsonDeserializer<>(PolicyChangeRequestStats.class));
}
}
public static final class PolicyChangeRequestSerde extends WrapperSerde<PolicyChangeRequest> {
PolicyChangeRequestSerde() {
super(new JsonSerializer<>(), new JsonDeserializer<>(PolicyChangeRequest.class));
}
}
}
isValidEvent - returns true
.mapValues(PolicyChangeReqStreamProcessor::convertPolicyChangeReqJsonToObj) - This will convert the incoming json string to a PolicyChangeRequest object
Till the operation map((key, value) -> new KeyValue<>("key", value)), the Custom Object - PolicyChangeRequest is fine as per the incoming message (I've tested by printing the stream there).
But after going into the Groupby and aggregate operation, the Custom Object got changed as
PolicyChangeRequest{coiRequestGuid='null', accountId='null', companyName='null', existingPolicyCoverageLimit=null, isChangeRequested=null, newlyRequestedPolicyCoverageLimit=null, isNewRecipient=null, newRecipientGuid='null', existingRecipientId='null', recipientName='null', recipientEmail='null'}
I found the above value by putting a log statement inside the policyStats.add(v) method i've called inside the aggregate method.
The add method is in the PolicyChangeRequestStats class
public PolicyChangeRequestStats add(PolicyChangeRequest policyChangeRequest) {
System.out.println("Incoming req: " + policyChangeRequest);
//Incrementing the Policy limit change request count
this.policyLimitChangeRequests++;
//Adding the Increased policy limit coverage to the existing increasedPolicyLimitCoverage
this.increasedPolicyLimitCoverage +=
(policyChangeRequest.getNewlyRequestedPolicyCoverageLimit() -
policyChangeRequest.getExistingPolicyCoverageLimit());
return this;
}
I'm getting NullPointerException in the line where I'm adding the policyChangeRequest.getNewlyRequestedPolicyCoverageLimit() - policyChangeRequest.getExistingPolicyCoverageLimit() as the values were null in the PolicyChangeRequest object
I've provided the valid Serde classes for the key and Value while doing groupBy .groupByKey(Grouped.with(Serdes.String(), new PolicyChangeRequestSerde())).
For Serialization and Desrialization I used Gson.
But I can't able to get the PolicyChangeRequest object as is before it was sent to the grouping operation.
I'm new to kafka Streams and I'm not sure whether I missed anything or whether the process I'm doing is correct or not.
Can anyone guide me here?

Concatenate logs by ID and time using Kafka Streams - Failed to flush state store

I want to concatenate logs by ID within a window of time using Kafka Streams.
For now, I can successfully count the number of logs having a same ID (the commented code).
However, when I replace the .count method with .aggregate I face following error:
"Failed to flush state store time-windowed-aggregation-stream-store"
Caused by: java.lang.ClassCastException: org.apache.kafka.streams.kstream.Windowed cannot be cast to java.lang.String
I'm new to this and can't figure out the cause of this error, I think that having .withValueSerde(Serdes.String()) is supposed to prevent this.
Below my code:
package myapps;
import java.time.Duration;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.utils.Bytes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.*;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Suppressed.*;
import org.apache.kafka.streams.state.WindowStore;
public class MyCode {
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-mycode");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("streams-plaintext-input");
KStream<String, String> changedKeyStream = source.selectKey((k, v)
-> v.substring(v.indexOf("mid="),v.indexOf("mid=")+8));
/* // Working code for count
changedKeyStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(3))
.grace(Duration.ofSeconds(2)))
.count(Materialized.with(Serdes.String(), Serdes.Long())) // could be replaced with an aggregator (reducer?) ?
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
.toStream()
.print(Printed.toSysOut());
*/
changedKeyStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(3)))
.aggregate(
String::new, (String k, String v, String Result) -> { return Result+"\n"+v; },
Materialized.<String, String, WindowStore<Bytes, byte[]>>as("time-windowed-aggregated-stream-store") /* state store name */
.withValueSerde(Serdes.String())) /* serde for aggregate value */
.suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()))
.toStream()
.print(Printed.toSysOut());
changedKeyStream.to("streams-mycode-output", Produced.with(Serdes.String(), Serdes.String()));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
// launch until control+c
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.out.print("Something went wrong!");
System.exit(1);
}
System.exit(0);
}
}
Thank you in advance for your help.
There are two option to fix it:
Pass org.apache.kafka.streams.kstream.Grouped to KStream::groupByKey.
Set org.apache.kafka.common.serialization.Serde to Materialized - Materialized::withKeySerde(...)
Sample code bellow:
Ad 1.
changedKeyStream
.groupByKey(Grouped.with(Serdes.String(), Serdes.String()))
.windowedBy(TimeWindows.of(Duration.ofSeconds(3)))
Ad 2.
changedKeyStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(3)))
.aggregate(
String::new, (String k, String v, String Result) -> { return Result+"_"+v; },
Materialized.<String, String, WindowStore<Bytes, byte[]>>as("time-windowed-aggregated-stream-store") /* state store name */
.withValueSerde(Serdes.String())
.withKeySerde(Serdes.String())
)

Kafka topic partition and Spark executor mapping

I am using spark streaming with kafka topic. topic is created with 5 partitions. My all messages are published to the kafka topic using tablename as key.
Given this i assume all messages for that table should goto the same partition.
But i notice in the spark log messages for same table sometimes goes to executor's node-1 and sometime goes to executor's node-2.
I am running code in yarn-cluster mode using following command:
spark-submit --name DataProcessor --master yarn-cluster --files /opt/ETL_JAR/executor-log4j-spark.xml,/opt/ETL_JAR/driver-log4j-spark.xml,/opt/ETL_JAR/application.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=driver-log4j-spark.xml" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=executor-log4j-spark.xml" --class com.test.DataProcessor /opt/ETL_JAR/etl-all-1.0.jar
and this submission creates 1 driver lets say on node-1 and 2 executors on node-1 and node-2.
I don't want node-1 and node-2 executors to read the same partition. but this is happening
Also tried following configuration to specify consumer group but no difference.
kafkaParams.put("group.id", "app1");
This is how we are creating the stream using createDirectStream method
*Not through zookeeper.
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("auto.offset.reset", "largest");
kafkaParams.put("group.id", "app1");
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
Complete Code:
import java.io.Serializable;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class DataProcessor2 implements Serializable {
private static final long serialVersionUID = 3071125481526170241L;
private static Logger log = LoggerFactory.getLogger("DataProcessor");
public static void main(String[] args) {
final String sparkCheckPointDir = ApplicationProperties.getProperty(Consts.SPARK_CHECKPOINTING_DIR);
DataProcessorContextFactory3 factory = new DataProcessorContextFactory3();
JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(sparkCheckPointDir, factory);
// Start the process
jssc.start();
jssc.awaitTermination();
}
}
class DataProcessorContextFactory3 implements JavaStreamingContextFactory, Serializable {
private static final long serialVersionUID = 6070911284191531450L;
private static Logger logger = LoggerFactory.getLogger(DataProcessorContextFactory.class);
DataProcessorContextFactory3() {
}
#Override
public JavaStreamingContext create() {
logger.debug("creating new context..!");
final String brokers = ApplicationProperties.getProperty(Consts.KAFKA_BROKERS_NAME);
final String topic = ApplicationProperties.getProperty(Consts.KAFKA_TOPIC_NAME);
final String app = "app1";
final String offset = ApplicationProperties.getProperty(Consts.KAFKA_CONSUMER_OFFSET, "largest");
logger.debug("Data processing configuration. brokers={}, topic={}, app={}, offset={}", brokers, topic, app,
offset);
if (StringUtils.isBlank(brokers) || StringUtils.isBlank(topic) || StringUtils.isBlank(app)) {
System.err.println("Usage: DataProcessor <brokers> <topic>\n" + Consts.KAFKA_BROKERS_NAME
+ " is a list of one or more Kafka brokers separated by comma\n" + Consts.KAFKA_TOPIC_NAME
+ " is a kafka topic to consume from \n\n\n");
System.exit(1);
}
final String majorVersion = "1.0";
final String minorVersion = "3";
final String version = majorVersion + "." + minorVersion;
final String applicationName = "DataProcessor-" + topic + "-" + version;
// for dev environment
SparkConf sparkConf = new SparkConf().setMaster("local[*]").setAppName(applicationName);
// for cluster environment
//SparkConf sparkConf = new SparkConf().setAppName(applicationName);
final long sparkBatchDuration = Long
.valueOf(ApplicationProperties.getProperty(Consts.SPARK_BATCH_DURATION, "10"));
final String sparkCheckPointDir = ApplicationProperties.getProperty(Consts.SPARK_CHECKPOINTING_DIR);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(sparkBatchDuration));
logger.debug("setting checkpoint directory={}", sparkCheckPointDir);
jssc.checkpoint(sparkCheckPointDir);
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(topic.split(",")));
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("auto.offset.reset", offset);
kafkaParams.put("group.id", "app1");
// #formatter:off
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
// #formatter:on
processRDD(messages, app);
return jssc;
}
private void processRDD(JavaPairInputDStream<String, String> messages, final String app) {
JavaDStream<MsgStruct> rdd = messages.map(new MessageProcessFunction());
rdd.foreachRDD(new Function<JavaRDD<MsgStruct>, Void>() {
private static final long serialVersionUID = 250647626267731218L;
#Override
public Void call(JavaRDD<MsgStruct> currentRdd) throws Exception {
if (!currentRdd.isEmpty()) {
logger.debug("Receive RDD. Create JobDispatcherFunction at HOST={}", FunctionUtil.getHostName());
currentRdd.foreachPartition(new VoidFunction<Iterator<MsgStruct>>() {
#Override
public void call(Iterator<MsgStruct> arg0) throws Exception {
while(arg0.hasNext()){
System.out.println(arg0.next().toString());
}
}
});
} else {
logger.debug("Current RDD is empty.");
}
return null;
}
});
}
public static class MessageProcessFunction implements Function<Tuple2<String, String>, MsgStruct> {
#Override
public MsgStruct call(Tuple2<String, String> data) throws Exception {
String message = data._2();
System.out.println("message:"+message);
return MsgStruct.parse(message);
}
}
public static class MsgStruct implements Serializable{
private String message;
public static MsgStruct parse(String msg){
MsgStruct m = new MsgStruct();
m.message = msg;
return m;
}
public String toString(){
return "content inside="+message;
}
}
}
According to Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher), you can specify an explicit mapping of partitions to hosts.
Assume you have two hosts(h1 and h2), and the Kafka topic topic-name has three partitions. The following critical code will show you how to map a specified partition to a host in Java.
Map<TopicPartition, String> partitionMapToHost = new HashMap<>();
// partition 0 -> h1, partition 1 and 2 -> h2
partitionMapToHost.put(new TopicPartition("topic-name", 0), "h1");
partitionMapToHost.put(new TopicPartition("topic-name", 1), "h2");
partitionMapToHost.put(new TopicPartition("topic-name", 2), "h2");
List<String> topicCollection = Arrays.asList("topic-name");
Map<String, Object> kafkaParams = new HasMap<>();
kafkaParams.put("bootstrap.servers", "10.0.0.2:9092,10.0.0.3:9092");
kafkaParams.put("group.id", "group-id-name");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
JavaInputDStream<ConsumerRecord<String, String>> records = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferFixed(partitionMapToHost), // PreferFixed is the key
ConsumerStrategies.Subscribe(topicCollection, kafkaParams));
You can also use LocationStrategies.PreferConsistent(), which distribute partitions evenly across available executors, and assure that a specified partition is only consumed by a specified executor.
Using the DirectStream approach it's a correct assumption that messages sent to a Kafka partition will land in the same Spark partition.
What we cannot assume is that each Spark partition will be processed by the same Spark worker each time. On each batch interval, Spark task are created for each OffsetRange for each partition and sent to the cluster for processing, landing on some available worker.
What you are looking for partition locality. The only partition locality that the direct kafka consumer supports is the kafka host containing the offset range being processed in the case that you Spark and Kafka deployements are colocated; but that's a deployment topology that I don't see very often.
In case that your requirements dictate the need to have host locality, you should look into Apache Samza or Kafka Streams.

Categories

Resources