Why does kafka streams reprocess the messages produced after broker restart

Why does kafka streams reprocess the messages produced after broker restart - java

I have a single node kafka broker and simple streams application. I created 2 topics (topic1 and topic2).
Produced on topic1 - processed message - write to topic2
Note: For each message produced only one message is written to destination topic
I produced a single message. After it was written to topic2, I stopped the kafka broker. After sometime I restarted the broker and produced another message on topic1. Now streams app processed that message 3 times. Now without stopping the broker I produced messages to topic1 and waited for streams app to write to topic2 before producing again.
Streams app is behaving strangely. Sometimes for one produced message there are 2 messages written to destination topic and sometimes 3. I don't understand why is this happening. I mean even the messages produced after broker restart are being duplicated.
Update 1:
I am using Kafka version 1.0.0 and Kafka-Streams version 1.1.0
Below is the code.
Main.java
String credentials = env.get("CREDENTIALS");
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "activity-collection");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.RECONNECT_BACKOFF_MS_CONFIG, 100000);
props.put(StreamsConfig.RECONNECT_BACKOFF_MAX_MS_CONFIG, 200000);
props.put(StreamsConfig.REQUEST_TIMEOUT_MS_CONFIG, 60000);
props.put(StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 60000);
props.put(StreamsConfig.producerPrefix(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG), true);
props.put(StreamsConfig.producerPrefix(ProducerConfig.ACKS_CONFIG), "all");
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> activityStream = builder.stream("activity_contenturl");
KStream<String, String> activityResultStream = AppUtil.hitContentUrls(credentials , activityStream);
activityResultStream.to("o365_user_activity");
AppUtil.java
public static KStream<String, String> hitContentUrls(String credentials, KStream<String, String> activityStream) {
KStream<String, String> activityResultStream = activityStream
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
ArrayList<String> log = new ArrayList<String>();
JSONObject received = new JSONObject(value);
String url = received.get("url").toString();
String accessToken = ServiceUtil.getAccessToken(credentials);
JSONObject activityLog = ServiceUtil.getActivityLogs(url, accessToken);
log.add(activityLog.toString());
}
return log;
}
});
return activityResultStream;
}
Update 2:
In a single broker and single partition environment with the above config, I started the Kafka broker and streams app. Produced 6 messages on source topic and when I started a consumer on destination topic there are 36 messages and counting. They keep on coming.
So I ran this to see consumer-groups:
kafka_2.11-1.1.0/bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --list
Output:
streams-collection-app-0
Next I ran this:
kafka_2.11-1.1.0/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group streams-collection-app-0
Output:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 1 0 streams-collection-app-0-244b6f55-b6be-40c4-9160-00ea45bba645-StreamThread-1-consumer-3a2940c2-47ab-49a0-ba72-4e49d341daee /127.0.0.1 streams-collection-app-0-244b6f55-b6be-40c4-9160-00ea45bba645-StreamThread-1-consumer
After a while the output showed this:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 6 5 - - -
And then:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
o365_activity_contenturl 0 1 7 6 - - -

seems you are facing with known limitation. Kafka topic by default stores messages at least 7 days, but committed offsets stored for 1 day (default config value offsets.retention.minutes = 1440). so if no messages were produced to your source topic during more than 1 day, after app restart all messages from topic will be reprocessed again (actually multiple times, depending on number of restarts, max 1 time per day per such topic with rare incoming messages).
you could find description of expiration committed offsets How does an offset expire for consumer group.
in kafka version 2.0 retention for committed offsets was increased KIP-186: Increase offsets retention default to 7 days.
to prevent reprocessing, you could add consumer property auto.offset.reset: latest (default value is earliest).
there is exist a small risk with latest: if no one produced message into a source topic longer that day, and after that you restarted app, you could lost some messages (only messages which arrived exactly during restart).

Related

When Kafka connection is idle?

I faced an issue that when my Kafka consumers read some topic and it does not have messages for some time then Kafka thinks that connection is idle as I understood. I saw these trace logs:
About to close the idle connection from 2147482646 due to being idle for 540005 millis
Node 2147482646 disconnected.
But why it's idle? I have a default config connections.max.idle.ms: 9 mins, also heartbeat checks every 5 secs, session ms is 30 sec. As I understood maybe Kafka does not recognize heartbeats as "actions". Okay then we have regular poll() calls and as for me, it's an action even if I got 0 messages. But then I found an idea that connection is idle if nothing was written to the TCP socket for some period of time. So if my Kafka gets nothing within 9 mins (messages) then it's idle? Is it the truth? Or how then kafka recognizes that connection is idle?
UPD:
Poll is an action no matter how many messages you got. I ran kafka locally and started debugging with CONNECTIONS_MAX_IDLE_MS_CONFIG=5000 and I found out that my kafka consumer created 2 different connections and then one of them after some time was not used for 5 secs and it caused the closing of the connection. so why it happens that one connection was not used for a long time? Since I tried 10secs and the same. On my prod config is default 9 mins but it happens.
And what's interesting after I faced it 1 time then kafka recreates a connection and everything is fine. But maybe if its a group coordinator then it makes kafka rebalance and create all connections from scratch an issue appears again.
kafka-clients version 3.2.0. My code for testing:
Logger logger = LogManager.getLogger(Main.class);
logger.info("test");
String bootstrapServers = "127.0.0.1:9092";
String grp_id = "test";
String topic = "test";
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, grp_id);
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.setProperty(ConsumerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG, "5000");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
consumer.subscribe(Collections.singletonList(topic));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
System.out.println("Key: " + record.key() + ", Value:" + record.value());
System.out.println("Partition:" + record.partition() + ",Offset:" + record.offset());
}
}

kafka does not have messages that producer sent

I use spring framework and kafka that has 3 brokers clustered. I found out that consumer did not consume some messages (let say 0.01 percent between all sent messages) so in producer code I log message offset that returned by api:
ListenableFuture<SendResult<String, Message>> future = messageTemplate.sendDefault(id, message);
SendResult<String, Message> sendResult = future.get();
String offset = sendResult.getRecordMetadata().offset();
I use return offset to query kafka topic in all partition but it do not find the message (I test other offsets related to messages that consumers used and they are in kafka), whats the problem and how can I insure that message sent to kafka??
I also used messageTemplate.flush(); in producer

I find out that when topic leader of Kafka broker goes down Kafka will rebalance itself and another broker becomes the leader of that partition and if the ack config is not set to all there is a chance of losing some data in this procedure. so change the config to
ack=all
also, there is a chance of losing data if minimum in sync replica becomes less than 2, so set it at least to 2.
min.insync.replicas = 2

Why is there a lag when consuming from Apache Kafka using Java KafkaConsumer

I hope someone can shed some light on my issue.
I'm making a series of REST calls to a micro-service I have built. When the service receives the calls, it persists some data to a database and then publishes a message to a Kafka topic.
I'm trying to write a test that makes the REST calls and then consumes the messages from the Kafka topic.
When my test tries to consume from Kafka, it doesnt appear to be consuming the latest messages.
I've used the kafka-consumer-groups.sh shell script which ships with Kafka to describe the state of my consumer. Here is what it looks like:
bash-4.3# ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group test
Note: This will only show information about consumers that use the Java consumer API (non-ZooKeeper-based consumers).
Consumer group 'test' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test_topic 0 42 43 1 - - -
Note the lag of 1. This is what appears to be my issue.
Here is my Kafka consumer code:
public void consumeMessages() {
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
try (KafkaConsumer < String, String > kafkaConsumer = new KafkaConsumer < > (props)) {
kafkaConsumer.subscribe(Collections.singletonList("test_topic"));
ConsumerRecords < String, String > consumerRecords = kafkaConsumer.poll(5000);
for (ConsumerRecord < String, String > consumerRecord: consumerRecords) {
System.out.printf("offset = %d, message = %s%n", consumerRecord.offset(), consumerRecord.value());
}
}
}
Any help will be greatly received.
Thanks,
Ben :)

Can a Kafka broker retain messages while there are no consumers connected?

I am trying to build a pub/sub application and I am exploring the best tools out there. I am currently looking at Kafka and have a little demo app already running. However, I am running into a conceptual issue.
I have a producer (Java code):
String topicName = "MyTopic;
String key = "MyKey";
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093");
props.put("acks", "all");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Producer<String, byte[]> producer = new KafkaProducer <String, byte[]>(props);
byte[] data = <FROM ELSEWHERE>;
ProducerRecord<String, byte[]> record = new ProducerRecord<String, byte[]>(topicName, key, data);
try {
RecordMetadata result = producer.send(record).get();
}
catch (Exception e) {
// Nothing for now
}
producer.close();
When I start a consumer via the Kakfa command line tools:
kafka-console-consumer --bootstrap-server localhost:9092 --topic MyTopic
and then I execute the producer code, I see the data message show up on my consumer terminal.
However, if I do not run the consumer prior executing the producer, the message appears "lost". When I start the consumer (after executing the producer), nothing appears in the consumer terminal.
Does anyone know if it's possible to have the Kafka broker retain messages while there are no consumers connected? If so, how?

Append --from-beginning to the console consumer command to have it start consuming from the earliest offset. This is actually about the offset reset strategy which is controlled by config auto.offset.reset. Here is what this config means:
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.

Specify number of partitions on Kafka producer

I have the following code from the web page https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example
What seems to be missing is how to configure the number of partitions. I want to specify 4 partitions but am always ending up with the default of 2 partitions. How do I change the code to have 4 partitions(without changing the default).
Properties props = new Properties();
props.put("metadata.broker.list", "localhost:9092,broker2:9092");
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("partitioner.class", "com.gnip.kafka.SimplePartitioner");
props.put("request.required.acks", "1");
props.put("num.partitions", 4);
ProducerConfig config = new ProducerConfig(props);
Producer<String, String> producer = new Producer<String, String>(config);
Random rnd = new Random();
for (long nEvents = 0; nEvents < 1000; nEvents++) {
long runtime = new Date().getTime();
String ip = "192.168.2." + rnd.nextInt(255);
String msg = runtime + ",www.example.com," + ip;
KeyedMessage<String, String> data = new KeyedMessage<String, String>("page_visits2", ip, msg);
producer.send(data);
}
producer.close();

The Kafka producer api does not allow you to create custom partition, if you try to produce some data to a topic which does not exists it will first create the topic if the auto.create.topics.enable property in the BrokerConfig is set to TRUE and start publishing data on the same but the number of partitions created for this topic will based on the num.partitions parameter defined in the configuration files (by default it is set to one).
Increasing partition count for an existing topic can be done, but it'll not move any existing data into those partitions.
To create a topic with different number of partition you need to create the topic first and the same can be done with the console script that shipped along with the Kafka distribution. The following command will allow you to create a topic with 2 partition (as specified by the --partition flag)
bin/kafka-create-topic.sh --zookeeper localhost:2181 --replica 3 --partition 2 --topic my-custom-topic
Unfortunately as far as my understanding goes currently there is no direct alternative to achieve this.

The number of partitions is a broker property and will not have any effect for the producer, see here.
As the producer example page shows, you can use a custom partitioner to route messages as you prefer but new partitions will not be created if not defined in the broker properties.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why does kafka streams reprocess the messages produced after broker restart - java

Related

When Kafka connection is idle?

kafka does not have messages that producer sent

Why is there a lag when consuming from Apache Kafka using Java KafkaConsumer

Can a Kafka broker retain messages while there are no consumers connected?

Specify number of partitions on Kafka producer

Categories

Resources