Fetching offset from kafka high-level consumer

Fetching offset from kafka high-level consumer - java

I want to fetch kafka message offset from high-level consumer in java program. Since I am using custom commitoffset property, i want to test whether my custom commitoffset is working fine or not. Can anyone help me how to get offset???
i have come across with couple of kafka tools (like getoffsetshell) but it's not helpful in my testing.

When getting a message from the ConsumerIterator you can also get the offset by doing the following:
ConsumerConnector consumerConnector = Consumer.createJavaConsumerConnector(getConsumerConfig());
KafkaStream<byte[], byte[]> stream = getKafkaStream(consumerConnector);
ConsumerIterator<byte[], byte[]> iterator = stream.iterator();
while(iterator.hasNext()) {
MessageAndMetadata<byte[], byte[]> messageAndMetadata = iterator.next();
String message = new String(messageAndMetadata.message());
long offset = messageAndMetadata.offset();
}

Related

How to verify if a Kafka Topic is not empty meaning has at least 1 message?

I am writing a little Kafka metrics exporter (Yes there are loads available like prometheus etc but I want a light weight custom one. Kindly excuse me on this).
As part of this I would like to know as soon as first message is received (or topic has messages) in a Kafka topic. I am using Spring Boot and Kafka.
I have the below code which gives the name of the topic and number of partitions. I want to know if the topic has messages? Kindly let me know how can I get this stat. Any lead is much appreciated!
#ReadOperation
public List<TopicManifest> kafkaTopic() throws ExecutionException, InterruptedException {
ListTopicsOptions listTopicsOptions = new ListTopicsOptions();
listTopicsOptions.listInternal(true);
ListTopicsResult listTopicsResult = adminClient.listTopics(listTopicsOptions);
Set<String> topics = listTopicsResult.names().get().stream().filter(topic -> !topic.startsWith("_")).collect(Collectors.toSet());
System.out.println(topics);
DescribeTopicsResult describeTopicsResult = adminClient.describeTopics(topics);
Map<String, KafkaFuture<TopicDescription>> topicNameValues = describeTopicsResult.topicNameValues();
List<TopicManifest> topicManifests = topicNameValues.entrySet().stream().map(entry -> {
try {
TopicDescription topicDescription = entry.getValue().get();
return TopicManifest.builder().name(entry.getKey())
.noOfPartitions(topicDescription.partitions().size())
.build();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
return null;
}).collect(Collectors.toList());
return topicManifests;
}

Create a KafkaConsumer and call endOffsets (the consumer does not need to be subscribed to the topic(s)).
#Bean
ApplicationRunner runner1(ConsumerFactory cf) {
return args -> {
try (Consumer consumer = cf.createConsumer()) {
System.out.println(consumer.endOffsets(List.of(new TopicPartition("ktest29", 0),
new TopicPartition("ktest29", 1),
new TopicPartition("ktest29", 2))));
}
};
}

Offsets stored on the topic never reduce. Getting the end offset doesn't guarantee you have a non-empty topic (start and end offsets for the topic partitions could be the same).
Instead, you will still create a consumer but set
auto.offset.reset=earliest
group.id=UUID.randomUUID()
Then subscribe and run
ConsumerRecords records = consumer.poll(Duration.ofSeconds(2));
boolean empty = records.count() == 0;
By setting auto.offset.earliest with a random group, you are guaranteed to start at the earliest offset and seek to the first available record, if it exists, at which point, you can try to poll any number of records, to see if any are returned within the specified timeout.
This should work for regular and compacted topics without needing to check committed offsets.

Kafka Stream to sort messages based on timestamp key in json message

I am publishing Kafka with JSON messages, eg:
"UserID":111,"UpdateTime":06-13-2018 12:13:43.200Z,"Comments":2,"Like":10
"UserID":111,"UpdateTime":06-13-2018 12:13:40.200Z,"Comments":0,"Like":6
"UserID":222,"UpdateTime":06-13-2018 12:13:43.200Z,"Comments":1,"Like":10
"UserID":111,"UpdateTime":06-13-2018 12:13:44.600Z,"Comments":3,"Like":12
I would like to sort messages based on UpdateTime in 10 second time window using Kafka Streams and push back sorted messages in another Kafka topic.
I have created a stream, which reads data from the input topic and then I am creating TimeWindowedKStream after groupByKey() where the UserID is the key in the message (Although its not necessary to groupByKey and then sort, but I could not get WindowedBy directly). But I am not able to sort messages in 10 second window based on UpdateTime further. My source code is:
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-sorting");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "broker");
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("UnsortedMessages");
TimeWindowedKStream<String, String> countss = source.groupByKey().windowedBy(TimeWindows.of(10000L)
.until(10000L));
/*
SORTING CODE
*/
outputMessage.toStream().to("SortedMessages", Produced.with(Serdes.String(), Serdes.Long()));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-sorting-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
Many thanks in advance.

If you want to sort messages ignoring the key, it makes only sense to do this based on partitions and also only if the input topic has the same number of partitions as the output topic. For this case, you should extract the partition number and use it as message key (cf: https://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information)
For sorting, it's more tricky. Note, that Kafka Streams follows a "continuous output" model and does emit updates for each input record using the DSL. Thus, it might be better to use Processor API. You would use a Processor with an attached store and put records into the store. As an in-memory structure you keep a sorted list of records. While time advances, you can emit "finished" windows and delete the corresponding records from the store.
I don't think you can build this using the DSL.

How to reset Kafka consumer offset when calling consumer many times

I'm trying to reset consumer offset whenever calling consumer so that when I call consumer many times it can still read record sent by producer. I'm setting props.put("auto.offset.reset","earliest"); and calling consumer.seekToBeginning(consumer.assignment()); but when I call the consumer the second time it will receive no records. How can I fix this?
public ConsumerRecords<String, byte[]> consumer(){
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
//props.put("group.id", String.valueOf(System.currentTimeMillis()));
props.put("auto.offset.reset","earliest");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
KafkaConsumer<String, byte[]> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("topiccc"));
ConsumerRecords<String, byte[]> records = consumer.poll(100);
consumer.seekToBeginning(consumer.assignment());
/* List<byte[]> videoContents = new ArrayList<byte[]>();
for (ConsumerRecord<String, byte[]> record : records) {
System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value());
videoContents.add(record.value());
}*/
return records;
}
public String producer(#RequestParam("message") String message) {
Map<String, Object> props = new HashMap<>();
// list of host:port pairs used for establishing the initial connections to the Kakfa cluster
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Producer<String, byte[]> producer = new KafkaProducer<>(props);
Path path = Paths.get("C:/Programming Files/video-2012-07-05-02-29-27.mp4");
ProducerRecord<String, byte[]> record = null;
try {
record = new ProducerRecord<>("topiccc", "keyyyyy"
, Files.readAllBytes(path));
} catch (IOException e) {
e.printStackTrace();
}
producer.send(record);
producer.close();
//kafkaSender.send(record);
return "Message sent to the Kafka Topic java_in_use_topic Successfully";
}

From the Kafka Java Code, the documentation on AUTO_OFFSET_RESET_CONFIG says the following:
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): earliest: automatically reset the offset to the earliest offsetlatest: automatically reset the offset to the latest offsetnone: throw exception to the consumer if no previous offset is found for the consumer's groupanything else: throw exception to the consumer.
This can be found here in GitHub:
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
We can see from their comment that the setting is only used when the offset is not on the server. In the question, the offset is retrieved from the server and that's why the offset is not reset to the beginning but rather stays at the last offset, making it appear that there are no more records.
You would need to explicitly reset the offset on the server side to fix this as requested in the question.
Here is another answer that describes how that could be done.
https://stackoverflow.com/a/54492802/231860
This is a snippet of code that allowed me to reset the offset. NOTE: You can't call seekToBeginning if you call the subscribe method. I could only get it to work if I assign the partitions myself using the assign method. Pity.
// Create the consumer:
final Consumer<String, DataRecord> consumer = new KafkaConsumer<>(props);
// Get the partitions that exist for this topic:
List<PartitionInfo> partitions = consumer.partitionsFor(topic);
// Get the topic partition info for these partitions:
List<TopicPartition> topicPartitions = partitions.stream().map(info -> new TopicPartition(info.topic(), info.partition())).collect(Collectors.toList());
// Assign all the partitions to the topic so that we can seek to the beginning:
// NOTE: We can't use subscribe if we use assign, but we can't seek to the beginning if we use subscribe.
consumer.assign(topicPartitions);
// Make sure we seek to the beginning of the partitions:
consumer.seekToBeginning(topicPartitions);
Yes, it seems extremely complicated to achieve a seemingly rudimentary use case. This might indicate that the whole kafka world just seems to want to read streams once.

I am usually creating a new consumer with different group.id to read again records.
So do it like that:
props.put("group.id", Instant.now().getEpochSecond());

There is a workaround for this (not a production solution, though) which is to change the group.id configuration value each time you consume. Setting auto.offset.reset to earliest is not enough in many cases.

When you want one message to be consumed by consumers multiple time the ideal way is to create consumers with different consumer group so same message can be consumed.
But if you want the same consumer to consume the same message multiple time then you can play with commit and offset
You set the auto.commit very high or disable it and do commit as per your logic
You can refer to this for more details https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
This javadoc provides detail on how to manually manage offset

using kafka-streams to conditionally sort a json input stream

I am new to developing kafka-streams applications. My stream processor is meant to sort json messages based on a value of a user key in the input json message.
Message 1: {"UserID": "1", "Score":"123", "meta":"qwert"}
Message 2: {"UserID": "5", "Score":"780", "meta":"mnbvs"}
Message 3: {"UserID": "2", "Score":"0", "meta":"fghjk"}
I have read here Dynamically connecting a Kafka input stream to multiple output streams that there is no dynamic solution.
In my use-case I know the user keys and output topics that I need to sort the input stream. So I am writing separate processor applications specific to each user where each processor application matches a different UserID.
All the different stream processor applications read from the same json input topic in kafka but each one only writes the message to a output topic for a specific user if the preset user condition is met.
public class SwitchStream extends AbstractProcessor<String, String> {
#Override
public void process(String key, String value) {
HashMap<String, String> message = new HashMap<>();
ObjectMapper mapper = new ObjectMapper();
try {
message = mapper.readValue(value, HashMap.class);
} catch (IOException e){}
// User condition UserID = 1
if(message.get("UserID").equals("1")) {
context().forward(key, value);
context().commit();
}
}
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "sort-stream-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
TopologyBuilder builder = new TopologyBuilder();
builder.addSource("Source", "INPUT_TOPIC");
builder.addProcessor("Process", SwitchStream::new, "Source");
builder.addSink("Sink", "OUTPUT_TOPIC", "Process");
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();
}
}
Question 1:
Is it possible to achieve the same functionality easily using the High-Level Streams DSL instead if the Low-Level Processor API? (I admit I found it harder understand and follow the other online examples of the High-Level Streams DSL)
Question 2:
The input json topic is getting input at a high rate 20K-25K EPS. My processor applications don't seem to be able to keep pace with this input stream. I have tried deploying multiple instances of each process but the results are nowhere close to where I want them to be. Ideally each processor instance should be able to process 3-5K EPS.
Is there a way to improve my processor logic or write the same processor logic using the high level streams DSL? would that make a difference?

You can do this in high-level DSL via filter() (you effectively implemented a filter as you only return a message if it's userID==1). You could generalize this filter pattern, by using KStream#branch() (see the docs for further details: http://docs.confluent.io/current/streams/developer-guide.html#stateless-transformations). Also read the JavaDocs: http://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/streams
KStreamBuilder builder = new KStreamBuilder();
builder.stream("INPUT_TOPIC")
.filter(new Predicate() {
#Overwrite
boolean test(String key, String value) {
// put you processor logic here
return message.get("UserID").equals("1")
}
})
.to("OUTPUT_TOPIC");
About performance. A single instance should be able to process 10K+ records. It's hard to tell without any further information what the problem might be. I would recommend to ask at Kafka user list (see http://kafka.apache.org/contact)

Kafka consumer in java not consuming messages

I am trying to a kafka consumer to get messages which are produced and posted to a topic in Java. My consumer goes as follows.
consumer.java
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
import kafka.javaapi.message.ByteBufferMessageSet;
import kafka.message.MessageAndOffset;
public class KafkaConsumer extends Thread {
final static String clientId = "SimpleConsumerDemoClient";
final static String TOPIC = " AATest";
ConsumerConnector consumerConnector;
public static void main(String[] argv) throws UnsupportedEncodingException {
KafkaConsumer KafkaConsumer = new KafkaConsumer();
KafkaConsumer.start();
}
public KafkaConsumer(){
Properties properties = new Properties();
properties.put("zookeeper.connect","10.200.208.59:2181");
properties.put("group.id","test-group");
ConsumerConfig consumerConfig = new ConsumerConfig(properties);
consumerConnector = Consumer.createJavaConsumerConnector(consumerConfig);
}
#Override
public void run() {
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(TOPIC, new Integer(1));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumerConnector.createMessageStreams(topicCountMap);
KafkaStream<byte[], byte[]> stream = consumerMap.get(TOPIC).get(0);
System.out.println(stream);
ConsumerIterator<byte[], byte[]> it = stream.iterator();
while(it.hasNext())
System.out.println("from it");
System.out.println(new String(it.next().message()));
}
private static void printMessages(ByteBufferMessageSet messageSet) throws UnsupportedEncodingException {
for(MessageAndOffset messageAndOffset: messageSet) {
ByteBuffer payload = messageAndOffset.message().payload();
byte[] bytes = new byte[payload.limit()];
payload.get(bytes);
System.out.println(new String(bytes, "UTF-8"));
}
}
}
When I run the above code I am getting nothing in the console wheres the java producer program behind the screen is posting data continously under the 'AATest' topic. Also the in the zookeeper console I am getting the following lines when I try running the above consumer.java
[2015-04-30 15:57:31,284] INFO Accepted socket connection from /10.200.208.59:51780 (org.apache.zookeeper.
server.NIOServerCnxnFactory)
[2015-04-30 15:57:31,284] INFO Client attempting to establish new session at /10.200.208.59:51780 (org.apa
che.zookeeper.server.ZooKeeperServer)
[2015-04-30 15:57:31,315] INFO Established session 0x14d09cebce30007 with negotiated timeout 6000 for clie
nt /10.200.208.59:51780 (org.apache.zookeeper.server.ZooKeeperServer)
Also when I run a separate console-consumer pointing to the AATest topic, I am getting all the data produced by the producer to that topic.
Both consumer and broker are in the same machine whereas the producer is in different machine. This actually resembles this question. But going through it dint help me. Please help me.

Different answer but it happened to be initial offset (auto.offset.reset) for a consumer in my case. So, setting up auto.offset.reset=earliest fixed the problem in my scenario. Its because I was publishing event first and then starting a consumer.
By default, consumer only consumes events published after it started because auto.offset.reset=latest by default.
eg. consumer.properties
bootstrap.servers=localhost:9092
enable.auto.commit=true
auto.commit.interval.ms=1000
session.timeout.ms=30000
auto.offset.reset=earliest
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
Test
class KafkaEventConsumerSpecs extends FunSuite {
case class TestEvent(eventOffset: Long, hashValue: Long, created: Date, testField: String) extends BaseEvent
test("given an event in the event-store, consumes an event") {
EmbeddedKafka.start()
//PRODUCE
val event = TestEvent(0l, 0l, new Date(), "data")
val config = new Properties() {
{
load(this.getClass.getResourceAsStream("/producer.properties"))
}
}
val producer = new KafkaProducer[String, String](config)
val persistedEvent = producer.send(new ProducerRecord(event.getClass.getSimpleName, event.toString))
assert(persistedEvent.get().offset() == 0)
assert(persistedEvent.get().checksum() != 0)
//CONSUME
val consumerConfig = new Properties() {
{
load(this.getClass.getResourceAsStream("/consumer.properties"))
put("group.id", "consumers_testEventsGroup")
put("client.id", "testEventConsumer")
}
}
assert(consumerConfig.getProperty("group.id") == "consumers_testEventsGroup")
val kafkaConsumer = new KafkaConsumer[String, String](consumerConfig)
assert(kafkaConsumer.listTopics().asScala.map(_._1).toList == List("TestEvent"))
kafkaConsumer.subscribe(Collections.singletonList("TestEvent"))
val events = kafkaConsumer.poll(1000)
assert(events.count() == 1)
EmbeddedKafka.stop()
}
}
But if consumer is started first and then published, the consumer should be able to consume the event without auto.offset.reset required to be set to earliest.
References for kafka 0.10
https://kafka.apache.org/documentation/#consumerconfigs

In our case, we solved our problem with the following steps:
The first thing we found is that there is an config called 'retry' for KafkaProducer and its default value means 'No Retry'. Also, send method of the KafkaProducer is async without calling the get method of the send method's result. In this way, there is no guarantee to delivery produced messages to the corresponding broker without retry. So, you have to increase it a bit or can use idempotence or transactional mode of KafkaProducer.
The second case is about the Kafka and Zookeeper version. We chose the 1.0.0 version of the Kafka and Zookeeper 3.4.4. Especially, Kafka 1.0.0 had an issue about the connectivity with Zookeeper. If Kafka loose its connection to the Zookeeper with an unexpected exception, it looses the leadership of the partitions which didn't synced yet. There is an bug topic about this issue :
https://issues.apache.org/jira/browse/KAFKA-2729
After we found the corresponding logs at Kafka log which indicates same issue at topic above, we upgraded our Kafka broker version to the 1.1.0.
It is also important point to notice that small sized the partitions (like 100 or less), increases the throughput of the producer so if there is no enough consumer then the available consumer fall into the thread stuck on results with delayed messages(we measured delay with minutes, approximately 10-15 minutes). So you need to balance and configure the partition size and thread counts of your application correctly according to your available resources.

There might also be a case where kafka takes a long time to rebalance consumer groups when a new consumer is added to the same group id.
Check kafka logs to see if the group is rebalanced after starting your consumer

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.