How to acknowledge current offset in spring kafka for manual commit - java

I am using Spring Kafka first time and I am not able to use Acknowledgement.acknowledge() method for manual commit in my consumer code. please let me know if anything missing in my consumer configuration or listener code. or else is there other way to handle acknowledge offset based on condition.
Here i'm looking solution like if the offset is not committed/ acknowledge manually, it should pick same message/offset by consumer.
Configuration
import java.util.HashMap;
import java.util.Map;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.kafka.annotation.EnableKafka;
import org.springframework.kafka.config.ConcurrentKafkaListenerContainerFactory;
import org.springframework.kafka.core.DefaultKafkaConsumerFactory;
import org.springframework.kafka.listener.AbstractMessageListenerContainer.AckMode;
#EnableKafka
#Configuration
public class ConsumerConfig {
#Value(value = "${kafka.bootstrapAddress}")
private String bootstrapAddress;
#Value(value = "${kafka.groupId}")
private String groupId;
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> containerFactory() {
Map<String, Object> props = new HashMap<String, Object>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapAddress);
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 100);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
StringDeserializer.class);
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
factory.setConsumerFactory(new DefaultKafkaConsumerFactory<String, String>(
props));
factory.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);
factory.getContainerProperties().setSyncCommits(true);
return factory;
}
}
Listener
private static int value = 1;
#KafkaListener(id = "baz", topics = "${message.topic.name}", containerFactory = "containerFactory")
public void listenPEN_RE(#Payload String message,
#Header(KafkaHeaders.RECEIVED_PARTITION_ID) int partition,
#Header(KafkaHeaders.OFFSET) int offsets,
Acknowledgment acknowledgment) {
if (value%2==0){
acknowledgment.acknowledge();
}
value++;
}

Set the enable-auto-commit property to false:
propsMap.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
Set the ack-mode to MANUAL_IMMEDIATE:
factory.getContainerProperties().setAckMode(AbstractMessageListenerContainer.AckMode.MANUAL_IMMEDIATE);
Then, in your consumer/listener code, you can commit the offset manually, like this:
#KafkaListener(topics = "testKafka")
public void receive(ConsumerRecord<?, ?> consumerRecord,
Acknowledgment acknowledgment) {
System.out.println("Received message: ");
System.out.println(consumerRecord.value().toString());
acknowledgment.acknowledge();
}
Update: I created a small POC for this. Check it out here, might help you.

You need to do the following
1) Set enable-auto-commit property to false
consumerConfigProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
2) Set the ACK Mode to MANUL_IMMEDIATE
factory.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);
3) For processed records you need to call acknowledgment.acknowledge();
4) for failed records call acknowledgment.nack(10);
Note: the nack method takes a long parameter which is the sleep time and it should be less than max.poll.interval.ms
Below is a sample code
#KafkaListener(id = "baz", topics = "${message.topic.name}", containerFactory = "containerFactory")
public void listenPEN_RE(#Payload String message,
#Header(KafkaHeaders.RECEIVED_PARTITION_ID) int partition,
#Header(KafkaHeaders.OFFSET) int offsets,
Acknowledgment acknowledgment) {
if (value%2==0){
acknowledgment.acknowledge();
} else {
acknowledgment.nack(10); //sleep time should be less than max.poll.interval.ms
}
value++;
}

You can do following:
1. store the current record offset to file or DB.
2. Implement your kafka listener class with ConsumerAware.
3. Call registerSeekCallback as given below:
(registerSeekCallback(ConsumerSeekCallback callback)
{
callback.seek(topic, partition, offset)
}
So when the consumer goes down or new consumer is assigned , it start reading fomr the offset stored in your DB.

That doesn't work that way in Apache Kafka.
For the currently running consumer we may never worry about committing offsets. We need them persisted only for new consumers in the same consumer group. The current one track its offset in the memory. I guess somewhere on Broker.
If you need to refetch the same message in the same consumer maybe the next poll round, you should consider to use seek() functionality: https://docs.spring.io/spring-kafka/docs/2.0.1.RELEASE/reference/html/_reference.html#seek

I have a small nitpick with OP's configuration. If using #KafkaListener with ConcurrentKafkaListenerContainerFactory, dont forget to make the state thread-safe. Using private static int value = 1; isnt going to help. Use AtomicInteger

Related

Not getting avro messages while reading data from topic in java

I am writing code first time in java to consume AVRO data from kafka topic. I am using kafka-avro-console-producer to produce record.I am using leneseio/fast-data-dev image on Docker to UP kafka stack.
Producing record :
root#fast-data-dev / $ kafka-avro-console-producer --broker-list localhost:9092 --topic payengine --property schema.registry.url=http://localhost:8081 --property value.schema='{"type":"record", "name":"payengine", "fields":[{"name":"tin", "type":"string"},{"name":"ach","type":"string"}] }'
{"tin":"61582","ach":"I"}
{"tin":"97820","ach":"I"}
Now, to read this record, I have written below code. Also, it seems like I don't have to refer the schema while consuming records (as mentioned in the below reference link). I had also gone through one example, where SpecificAvroRecord was being used in place of GenericRecord, but that requires building class based on schema. I am not sure how GenericRecord points to correct schema, but don't see any schema reference in the example.
package com.github.psingh.Kafka;
import io.confluent.kafka.serializers.KafkaAvroDeserializer;
import io.confluent.kafka.serializers.KafkaAvroDeserializerConfig;
import org.apache.avro.generic.GenericRecord;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringSerializer;
import java.time.Duration;
import java.util.Collections;
import java.util.Properties;
public class SimpleConsumer_AvroSchema {
public static void main(String[] args) {
// System.out.println("Hello Kafka ");
// setting properties
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class.getName());
props.put(ConsumerConfig.GROUP_ID_CONFIG, "group1");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
props.put(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
//name topic
String topic = "payengine";
// create the consumer
KafkaConsumer<String, GenericRecord> consumer = new KafkaConsumer<String, GenericRecord>(props);
//subscribe to topic
consumer.subscribe(Collections.singleton(topic));
System.out.println("Waiting for the data...");
while (true) {
ConsumerRecords<String, GenericRecord> records = consumer.poll(Duration.ofMillis(5000));
for(ConsumerRecord<String,GenericRecord> record: records) {
System.out.print(record.value());
}
// consumer.commitSync();
}
}
}
The code built was successful.I was hoping to get console produced record here, but I am not getting anything :
Please suggest.
I have taken reference from here :
https://docs.confluent.io/current/schema-registry/serdes-develop/serdes-avro.html

Java Consumer for Kafka in Cloudera Quickstart not working

I have a cloudera Quickstart VM. i have installed Kafka parcels using Cloudera Manager and its working fine inside the VM using console based consumer and producer.
But when i try to use java based consumer it does not produce or consume messages. I can list the topics.
But i cannot consume messages.
following is my code.
package kafka_consumer;
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
public class mclass {
public static void main(String[] args) {
Properties props = new Properties();
props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.0.75.1:9092");
// Just a user-defined string to identify the consumer group
props.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
// Enable auto offset commit
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true");
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000");
props.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
// List of topics to subscribe to
consumer.subscribe(Arrays.asList("second_topic"));
for (String k_topic : consumer.listTopics().keySet()) {
System.out.println(k_topic);
}
while (true) {
try {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("Offset = %d\n", record.offset());
System.out.printf("Key = %s\n", record.key());
System.out.printf("Value = %s\n", record.value());
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
And following is the output of the code. while the console producer is producing messages but consumer is not able to receive it.
PS: i can telnet the port and ip of the Kafka broker. I can even list the topics. Consumer is constantly running without crashing but no messages is being consumed.

kafka java consumer not reading data

I am trying to write a simple java kafka consumer to read data using similar code as in https://github.com/bkimminich/apache-kafka-book-examples/blob/master/src/test/kafka/consumer/SimpleHLConsumer.java.
Looks like my app is able to connect, but its not fetching any data. Please suggest.
import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
//import scala.util.parsing.json.JSONObject
import scala.util.parsing.json.JSONObject;
public class SimpleHLConsumer {
private final ConsumerConnector consumer;
private final String topic;
public SimpleHLConsumer(String zookeeper, String groupId, String topic) {
Properties props = new Properties();
props.put("zookeeper.connect", zookeeper);
props.put("group.id", groupId);
// props.put("zookeeper.session.timeout.ms", "5000");
// props.put("zookeeper.sync.time.ms", "250");
// props.put("auto.commit.interval.ms", "1000");
consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(props));
this.topic = topic;
}
public void testConsumer() {
Map<String, Integer> topicCount = new HashMap<>();
topicCount.put(topic, 1);
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = consumer.createMessageStreams(topicCount);
System.out.println(consumerStreams);
List<KafkaStream<byte[], byte[]>> streams = consumerStreams.get(topic);
System.out.println(streams);
System.out.println(consumer);
for (final KafkaStream stream : streams) {
ConsumerIterator<byte[], byte[]> it = stream.iterator();
System.out.println("for loop");
System.out.println(it);
System.out.println("Message from Single Topic: " + new String(it.next().message()));
//System.out.println("Message from Single Topic: " + new String(it.message()));
while (it.hasNext()) {
System.out.println("in While");
System.out.println("Message from Single Topic: " + new String(it.next().message()));
}
}
// if (consumer != null) {
// consumer.shutdown();
// }
}
public static void main(String[] args) {
String topic = "test";
SimpleHLConsumer simpleHLConsumer = new SimpleHLConsumer("localhost:2181", "testgroup", topic);
simpleHLConsumer.testConsumer();
}
}
Here is the output i see in eclipse. It does seem to connect to my zookeeper , but it just hangs there, it does not display any message at all.
log4j:WARN No appenders could be found for logger (kafka.utils.VerifiableProperties).
log4j:WARN Please initialize the log4j system properly.
SLF4J: The requested version 1.6 by your slf4j binding is not compatible with [1.5.5, 1.5.6]
SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further details.
{test=[testgroup kafka stream]}
[testgroup kafka stream]
kafka.javaapi.consumer.ZookeeperConsumerConnector#6200f9cb
for loop
Consumer iterator hasNext is blocking call. It will block indefinitely if no new message is available for consumption.
To verify this, change your code to
// Comment 2 lines below
// System.out.println(it);
// System.out.println("Message from Single Topic: " + new String(it.next().message()));
// Line below is blocking. Your code will hang till next message in topic.
// Add new message in topic using producer, message will appear in console
while (it.hasNext()) {
Better way is to execute code in separate thread. Use consumer.timeout.ms to specify time in ms, after which consumer will throw timeout exception
// keepRunningThread is flag to control when to exit consumer loop
while(keepRunningThread)
{
try
{
if(it.hasNext())
{
System.out.println(new String(it.next().message()));
}
}
catch(ConsumerTimeoutException ex)
{
// Timeout exception waiting for kafka message
// Wait for 5 (or t) seconds before checking for message again
Thread.sleep(5000);
}
}‍‍‍‍‍‍‍‍‍‍

Going back in time in Kafka using offset

Is there a way to start a consumer from a specific offset using the initial properties that we pass
I know there is props.put("auto.offset.reset", "earliest") but that gets me to the beginning.
However I want to go back and my scenarios are as follows
Specify an offset where I want to start at
Specify the time where I want to start
And I want to do that using the initial properties as a preferred option.
If that is not possible then using some other mechanism
Attaching my Simple Consumer code for reference
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
public class SimpleConsumer {
public static void main(String[] args) throws Exception {
String topicName = "test3";
Properties props = new Properties();
String groupId = "single";
// Kafka consumer configuration settings
props.put("bootstrap.servers", "mymachine:9092");
props.put("group.id", groupId);
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset", "earliest");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
consumer.subscribe(Arrays.asList(topicName));
System.out.println("Starting the _NON-BATCH_ consumer ::: Topic=" + topicName+" GroupId="+groupId);
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("%s (offset:%d, key:%s, partition = %s, topic = %s)", record.value(), record.offset(), record.key(), record.partition(), record.topic());
System.out.println();
}
}
}
}
For scenario 1, you can use KafkaConsumer.seek(TopicPartition, offset) to specify the offset from which you read.
For scenario 2, Kafka 0.10.1.0 offers KafkaConsumer.offsetsForTimes method, allowing you to lookup the offsets for the given partitions by timestamp, then invoking seek() method to retrieve the desired messages you want.

Kafka topic partition and Spark executor mapping

I am using spark streaming with kafka topic. topic is created with 5 partitions. My all messages are published to the kafka topic using tablename as key.
Given this i assume all messages for that table should goto the same partition.
But i notice in the spark log messages for same table sometimes goes to executor's node-1 and sometime goes to executor's node-2.
I am running code in yarn-cluster mode using following command:
spark-submit --name DataProcessor --master yarn-cluster --files /opt/ETL_JAR/executor-log4j-spark.xml,/opt/ETL_JAR/driver-log4j-spark.xml,/opt/ETL_JAR/application.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=driver-log4j-spark.xml" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=executor-log4j-spark.xml" --class com.test.DataProcessor /opt/ETL_JAR/etl-all-1.0.jar
and this submission creates 1 driver lets say on node-1 and 2 executors on node-1 and node-2.
I don't want node-1 and node-2 executors to read the same partition. but this is happening
Also tried following configuration to specify consumer group but no difference.
kafkaParams.put("group.id", "app1");
This is how we are creating the stream using createDirectStream method
*Not through zookeeper.
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("auto.offset.reset", "largest");
kafkaParams.put("group.id", "app1");
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
Complete Code:
import java.io.Serializable;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class DataProcessor2 implements Serializable {
private static final long serialVersionUID = 3071125481526170241L;
private static Logger log = LoggerFactory.getLogger("DataProcessor");
public static void main(String[] args) {
final String sparkCheckPointDir = ApplicationProperties.getProperty(Consts.SPARK_CHECKPOINTING_DIR);
DataProcessorContextFactory3 factory = new DataProcessorContextFactory3();
JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(sparkCheckPointDir, factory);
// Start the process
jssc.start();
jssc.awaitTermination();
}
}
class DataProcessorContextFactory3 implements JavaStreamingContextFactory, Serializable {
private static final long serialVersionUID = 6070911284191531450L;
private static Logger logger = LoggerFactory.getLogger(DataProcessorContextFactory.class);
DataProcessorContextFactory3() {
}
#Override
public JavaStreamingContext create() {
logger.debug("creating new context..!");
final String brokers = ApplicationProperties.getProperty(Consts.KAFKA_BROKERS_NAME);
final String topic = ApplicationProperties.getProperty(Consts.KAFKA_TOPIC_NAME);
final String app = "app1";
final String offset = ApplicationProperties.getProperty(Consts.KAFKA_CONSUMER_OFFSET, "largest");
logger.debug("Data processing configuration. brokers={}, topic={}, app={}, offset={}", brokers, topic, app,
offset);
if (StringUtils.isBlank(brokers) || StringUtils.isBlank(topic) || StringUtils.isBlank(app)) {
System.err.println("Usage: DataProcessor <brokers> <topic>\n" + Consts.KAFKA_BROKERS_NAME
+ " is a list of one or more Kafka brokers separated by comma\n" + Consts.KAFKA_TOPIC_NAME
+ " is a kafka topic to consume from \n\n\n");
System.exit(1);
}
final String majorVersion = "1.0";
final String minorVersion = "3";
final String version = majorVersion + "." + minorVersion;
final String applicationName = "DataProcessor-" + topic + "-" + version;
// for dev environment
SparkConf sparkConf = new SparkConf().setMaster("local[*]").setAppName(applicationName);
// for cluster environment
//SparkConf sparkConf = new SparkConf().setAppName(applicationName);
final long sparkBatchDuration = Long
.valueOf(ApplicationProperties.getProperty(Consts.SPARK_BATCH_DURATION, "10"));
final String sparkCheckPointDir = ApplicationProperties.getProperty(Consts.SPARK_CHECKPOINTING_DIR);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(sparkBatchDuration));
logger.debug("setting checkpoint directory={}", sparkCheckPointDir);
jssc.checkpoint(sparkCheckPointDir);
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(topic.split(",")));
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("auto.offset.reset", offset);
kafkaParams.put("group.id", "app1");
// #formatter:off
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
// #formatter:on
processRDD(messages, app);
return jssc;
}
private void processRDD(JavaPairInputDStream<String, String> messages, final String app) {
JavaDStream<MsgStruct> rdd = messages.map(new MessageProcessFunction());
rdd.foreachRDD(new Function<JavaRDD<MsgStruct>, Void>() {
private static final long serialVersionUID = 250647626267731218L;
#Override
public Void call(JavaRDD<MsgStruct> currentRdd) throws Exception {
if (!currentRdd.isEmpty()) {
logger.debug("Receive RDD. Create JobDispatcherFunction at HOST={}", FunctionUtil.getHostName());
currentRdd.foreachPartition(new VoidFunction<Iterator<MsgStruct>>() {
#Override
public void call(Iterator<MsgStruct> arg0) throws Exception {
while(arg0.hasNext()){
System.out.println(arg0.next().toString());
}
}
});
} else {
logger.debug("Current RDD is empty.");
}
return null;
}
});
}
public static class MessageProcessFunction implements Function<Tuple2<String, String>, MsgStruct> {
#Override
public MsgStruct call(Tuple2<String, String> data) throws Exception {
String message = data._2();
System.out.println("message:"+message);
return MsgStruct.parse(message);
}
}
public static class MsgStruct implements Serializable{
private String message;
public static MsgStruct parse(String msg){
MsgStruct m = new MsgStruct();
m.message = msg;
return m;
}
public String toString(){
return "content inside="+message;
}
}
}
According to Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher), you can specify an explicit mapping of partitions to hosts.
Assume you have two hosts(h1 and h2), and the Kafka topic topic-name has three partitions. The following critical code will show you how to map a specified partition to a host in Java.
Map<TopicPartition, String> partitionMapToHost = new HashMap<>();
// partition 0 -> h1, partition 1 and 2 -> h2
partitionMapToHost.put(new TopicPartition("topic-name", 0), "h1");
partitionMapToHost.put(new TopicPartition("topic-name", 1), "h2");
partitionMapToHost.put(new TopicPartition("topic-name", 2), "h2");
List<String> topicCollection = Arrays.asList("topic-name");
Map<String, Object> kafkaParams = new HasMap<>();
kafkaParams.put("bootstrap.servers", "10.0.0.2:9092,10.0.0.3:9092");
kafkaParams.put("group.id", "group-id-name");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
JavaInputDStream<ConsumerRecord<String, String>> records = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferFixed(partitionMapToHost), // PreferFixed is the key
ConsumerStrategies.Subscribe(topicCollection, kafkaParams));
You can also use LocationStrategies.PreferConsistent(), which distribute partitions evenly across available executors, and assure that a specified partition is only consumed by a specified executor.
Using the DirectStream approach it's a correct assumption that messages sent to a Kafka partition will land in the same Spark partition.
What we cannot assume is that each Spark partition will be processed by the same Spark worker each time. On each batch interval, Spark task are created for each OffsetRange for each partition and sent to the cluster for processing, landing on some available worker.
What you are looking for partition locality. The only partition locality that the direct kafka consumer supports is the kafka host containing the offset range being processed in the case that you Spark and Kafka deployements are colocated; but that's a deployment topology that I don't see very often.
In case that your requirements dictate the need to have host locality, you should look into Apache Samza or Kafka Streams.

Categories

Resources