How to reset Kafka consumer offset when calling consumer many times - java

I'm trying to reset consumer offset whenever calling consumer so that when I call consumer many times it can still read record sent by producer. I'm setting props.put("auto.offset.reset","earliest"); and calling consumer.seekToBeginning(consumer.assignment()); but when I call the consumer the second time it will receive no records. How can I fix this?
public ConsumerRecords<String, byte[]> consumer(){
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
//props.put("group.id", String.valueOf(System.currentTimeMillis()));
props.put("auto.offset.reset","earliest");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
KafkaConsumer<String, byte[]> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("topiccc"));
ConsumerRecords<String, byte[]> records = consumer.poll(100);
consumer.seekToBeginning(consumer.assignment());
/* List<byte[]> videoContents = new ArrayList<byte[]>();
for (ConsumerRecord<String, byte[]> record : records) {
System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value());
videoContents.add(record.value());
}*/
return records;
}
public String producer(#RequestParam("message") String message) {
Map<String, Object> props = new HashMap<>();
// list of host:port pairs used for establishing the initial connections to the Kakfa cluster
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Producer<String, byte[]> producer = new KafkaProducer<>(props);
Path path = Paths.get("C:/Programming Files/video-2012-07-05-02-29-27.mp4");
ProducerRecord<String, byte[]> record = null;
try {
record = new ProducerRecord<>("topiccc", "keyyyyy"
, Files.readAllBytes(path));
} catch (IOException e) {
e.printStackTrace();
}
producer.send(record);
producer.close();
//kafkaSender.send(record);
return "Message sent to the Kafka Topic java_in_use_topic Successfully";
}

From the Kafka Java Code, the documentation on AUTO_OFFSET_RESET_CONFIG says the following:
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): earliest: automatically reset the offset to the earliest offsetlatest: automatically reset the offset to the latest offsetnone: throw exception to the consumer if no previous offset is found for the consumer's groupanything else: throw exception to the consumer.
This can be found here in GitHub:
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
We can see from their comment that the setting is only used when the offset is not on the server. In the question, the offset is retrieved from the server and that's why the offset is not reset to the beginning but rather stays at the last offset, making it appear that there are no more records.
You would need to explicitly reset the offset on the server side to fix this as requested in the question.
Here is another answer that describes how that could be done.
https://stackoverflow.com/a/54492802/231860
This is a snippet of code that allowed me to reset the offset. NOTE: You can't call seekToBeginning if you call the subscribe method. I could only get it to work if I assign the partitions myself using the assign method. Pity.
// Create the consumer:
final Consumer<String, DataRecord> consumer = new KafkaConsumer<>(props);
// Get the partitions that exist for this topic:
List<PartitionInfo> partitions = consumer.partitionsFor(topic);
// Get the topic partition info for these partitions:
List<TopicPartition> topicPartitions = partitions.stream().map(info -> new TopicPartition(info.topic(), info.partition())).collect(Collectors.toList());
// Assign all the partitions to the topic so that we can seek to the beginning:
// NOTE: We can't use subscribe if we use assign, but we can't seek to the beginning if we use subscribe.
consumer.assign(topicPartitions);
// Make sure we seek to the beginning of the partitions:
consumer.seekToBeginning(topicPartitions);
Yes, it seems extremely complicated to achieve a seemingly rudimentary use case. This might indicate that the whole kafka world just seems to want to read streams once.

I am usually creating a new consumer with different group.id to read again records.
So do it like that:
props.put("group.id", Instant.now().getEpochSecond());

There is a workaround for this (not a production solution, though) which is to change the group.id configuration value each time you consume. Setting auto.offset.reset to earliest is not enough in many cases.

When you want one message to be consumed by consumers multiple time the ideal way is to create consumers with different consumer group so same message can be consumed.
But if you want the same consumer to consume the same message multiple time then you can play with commit and offset
You set the auto.commit very high or disable it and do commit as per your logic
You can refer to this for more details https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
This javadoc provides detail on how to manually manage offset

Related

How to verify if a Kafka Topic is not empty meaning has at least 1 message?

I am writing a little Kafka metrics exporter (Yes there are loads available like prometheus etc but I want a light weight custom one. Kindly excuse me on this).
As part of this I would like to know as soon as first message is received (or topic has messages) in a Kafka topic. I am using Spring Boot and Kafka.
I have the below code which gives the name of the topic and number of partitions. I want to know if the topic has messages? Kindly let me know how can I get this stat. Any lead is much appreciated!
#ReadOperation
public List<TopicManifest> kafkaTopic() throws ExecutionException, InterruptedException {
ListTopicsOptions listTopicsOptions = new ListTopicsOptions();
listTopicsOptions.listInternal(true);
ListTopicsResult listTopicsResult = adminClient.listTopics(listTopicsOptions);
Set<String> topics = listTopicsResult.names().get().stream().filter(topic -> !topic.startsWith("_")).collect(Collectors.toSet());
System.out.println(topics);
DescribeTopicsResult describeTopicsResult = adminClient.describeTopics(topics);
Map<String, KafkaFuture<TopicDescription>> topicNameValues = describeTopicsResult.topicNameValues();
List<TopicManifest> topicManifests = topicNameValues.entrySet().stream().map(entry -> {
try {
TopicDescription topicDescription = entry.getValue().get();
return TopicManifest.builder().name(entry.getKey())
.noOfPartitions(topicDescription.partitions().size())
.build();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
return null;
}).collect(Collectors.toList());
return topicManifests;
}
Create a KafkaConsumer and call endOffsets (the consumer does not need to be subscribed to the topic(s)).
#Bean
ApplicationRunner runner1(ConsumerFactory cf) {
return args -> {
try (Consumer consumer = cf.createConsumer()) {
System.out.println(consumer.endOffsets(List.of(new TopicPartition("ktest29", 0),
new TopicPartition("ktest29", 1),
new TopicPartition("ktest29", 2))));
}
};
}
Offsets stored on the topic never reduce. Getting the end offset doesn't guarantee you have a non-empty topic (start and end offsets for the topic partitions could be the same).
Instead, you will still create a consumer but set
auto.offset.reset=earliest
group.id=UUID.randomUUID()
Then subscribe and run
ConsumerRecords records = consumer.poll(Duration.ofSeconds(2));
boolean empty = records.count() == 0;
By setting auto.offset.earliest with a random group, you are guaranteed to start at the earliest offset and seek to the first available record, if it exists, at which point, you can try to poll any number of records, to see if any are returned within the specified timeout.
This should work for regular and compacted topics without needing to check committed offsets.

Does a Kafka Consumer default batch size?

Does Kafka provide a default batch size for reading messages from a topic? I have the following code that is reading messages from a topic.
while (true) {
final ConsumerRecords<String, User> consumerRecords =
consumer.poll(500));
if (consumerRecords.count() == 0) {
noRecordsCount++;
if (noRecordsCount > giveUp) break;
else continue;
}
consumerRecords.forEach(record -> {
User user = record.value();
userArray.add(user);
});
insertInBatch(user)
consumer.commitAsync();
}
consumer.close();
In the insertInBatch method, I persist data to a database. This method is getting called every 500 records, even though I haven't specified any batch size in creating the Consumer.
I don't think there's anything special about the way I'm creating it. Using Avro for the messages, but I don't think that's significant(?)
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("auto.commit.enable", "false");
props.put("auto.offset.reset", "earliest");
props.put("key.serializer",StringSerializer.class.getName());
props.put("value.serializer",KafkaAvroDeserializer.class.getName());
props.put("schema.registry","http://localhost:8081");
Yes, there's a default max.poll.records
https://kafka.apache.org/documentation/#consumerconfigs
If you are inserting to a database, though, you'd be better off using Kafka Connect than writing a consumer with apparently no error handling (yet?)

Kafka Stream to sort messages based on timestamp key in json message

I am publishing Kafka with JSON messages, eg:
"UserID":111,"UpdateTime":06-13-2018 12:13:43.200Z,"Comments":2,"Like":10
"UserID":111,"UpdateTime":06-13-2018 12:13:40.200Z,"Comments":0,"Like":6
"UserID":222,"UpdateTime":06-13-2018 12:13:43.200Z,"Comments":1,"Like":10
"UserID":111,"UpdateTime":06-13-2018 12:13:44.600Z,"Comments":3,"Like":12
I would like to sort messages based on UpdateTime in 10 second time window using Kafka Streams and push back sorted messages in another Kafka topic.
I have created a stream, which reads data from the input topic and then I am creating TimeWindowedKStream after groupByKey() where the UserID is the key in the message (Although its not necessary to groupByKey and then sort, but I could not get WindowedBy directly). But I am not able to sort messages in 10 second window based on UpdateTime further. My source code is:
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-sorting");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "broker");
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("UnsortedMessages");
TimeWindowedKStream<String, String> countss = source.groupByKey().windowedBy(TimeWindows.of(10000L)
.until(10000L));
/*
SORTING CODE
*/
outputMessage.toStream().to("SortedMessages", Produced.with(Serdes.String(), Serdes.Long()));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-sorting-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
Many thanks in advance.
If you want to sort messages ignoring the key, it makes only sense to do this based on partitions and also only if the input topic has the same number of partitions as the output topic. For this case, you should extract the partition number and use it as message key (cf: https://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information)
For sorting, it's more tricky. Note, that Kafka Streams follows a "continuous output" model and does emit updates for each input record using the DSL. Thus, it might be better to use Processor API. You would use a Processor with an attached store and put records into the store. As an in-memory structure you keep a sorted list of records. While time advances, you can emit "finished" windows and delete the corresponding records from the store.
I don't think you can build this using the DSL.

Can't get KafkaProducer/KafkaConsumer to work in Scala

I'm trying to create a simple KafkaProducer and KafkaConsumer so I can send data to a topic on a broker, and then verify that the data was received. I have below the two methods I used to define my consumer and producer, and how I'm sending the message. The send method takes at lest 20 seconds to complete, and as far as I can tell the consumer.poll method never actually finishes, but the longest I've left it was 10 minutes.
Does anyone have a suggestion as to what I'm doing wrong? Is there some property for the producer/consumer that I'm not setting up correctly? Those properties are copied directly from the docs, so I don't understand why they won't work.
KafkaProducer docs
KafkaConsumer docs
"verify we can send to producer" in {
val consumer = createKafkaConsumer("address:9002")
val producer = createKafkaProducer("address:9002")
val message = "I am a message"
val record = new ProducerRecord[String, String]("myTopic", message)
producer.send(record)
TimeUnit.SECONDS.sleep(5)
val records = consumer.poll(5000)
println("records: "+records)
consumer1.close()
}
def createKafkaProducer(kafka: String): KafkaProducer[String,String] = {
val props = new Properties()
props.put("bootstrap.servers", kafka)
props.put("acks", "all")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
new KafkaProducer[String,String](props)
}
def createKafkaConsumer(kafka: String): KafkaConsumer[String, String] = {
val props = new Properties()
props.put("bootstrap.servers", kafka)
props.put("group.id", "test")
props.put("enable.auto.commit", "true")
props.put("auto.commit.interval.ms", "1000")
props.put("session.timeout.ms", "30000")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(Collections.singletonList("myTopic"))
consumer
}
Edit: I've updated my code so that I now get the response from the send method, and it seems that that times out with org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
Turns out I had a DNS issue that meant that I wasn't actually connecting to the broker. Fixing this allowed the messages to go through, there was nothing wrong with the config.

Kafka 0.8.2 consumer

I am implementing a simple Kafka consumer in Java. Here is the code:
public class TestConsumer {
public static void main(String []a) throws Exception{
Properties props = new Properties();
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("partition.assignment.strategy", "round-robin");
props.put("group.id", "test");
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
try{
consumer.subscribe("ay_sparktopic");
Map<String, ConsumerRecords<String, String>> msg = consumer.poll(100);
System.out.println(msg);
}catch(Exception e){
System.out.println("Exception");
}
}
}
Above consumer gives following error message:
16/03/30 18:01:07 WARN ConsumerConfig: The configuration group.id = test was supplied but isn't a known config.
16/03/30 18:01:07 WARN ConsumerConfig: The configuration partition.assignment.strategy = round-robin was supplied but isn't a known config.
Any documentation I check online gives either range or roundrobin as possible assignment strategies and groupId is a custom name to my knowledge. Not sure what would be right config values here.
It looks like you´re trying to use the new consumer API that´s only available in Kafka 0.9+. To use the older API you have to import classes from the kafka.javaapi.consumer.* package instead of the new org.apache.kafka.clients.consumer package.
consumer.subscribe and consumer.poll relates to the new API so if you really want to use the old API, you need to change your code accordingly. If you instead want to use the new consumer API, you need to run Kafka 0.9 or later.
Using the below dependency resolves the issue.
libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "0.9.0.0"
Even when you are having previous version running E.g.,kafka 0.8.2.1.

Categories

Resources