How to programmatically get latest offset of a Kafka topic in Java - java

Here is what I am trying
Collection <TopicPartition> partitions = consumer.partitionsFor(topic).stream();
And also how to indicate you've hit the end or there isn't anymore messages to consume. If the offset doesn't match the broker's end offset at the time how to do that.
Any suggestions.

In order to get the latest offset you can either use the command line:
./bin/kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list localhost:9092 \
--topic topicName
or programmatically in Java:
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(consumerProperties)) {
consumer.subscribe(Arrays.asList("topicName"));
Set<TopicPartition> assignment;
while ((assignment = consumer.assignment()).isEmpty()) {
consumer.poll(Duration.ofMillis(500));
}
consumer.endOffsets(assignment).forEach((partition, offset) -> System.out.println(partition + ": " + offset));
}
Now if you want to force the consumer to start consuming from the latest offset, you can either use the following property:
props.put("auto.offset.reset", "latest");
// or props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
or force it to consume from latest offset using
consumer.seekToEnd();

Related

Run kafka consumer without specifying partition

I am learning Kafka recently, and my consumers can't consume any records unless I specify the --parititon 0 parameter. In other words I can NOT consume records like:
kafka-console-consumer --bootstrap-server 127.0.0.10:9092 --topic first-topic
but works like:
kafka-console-consumer --bootstrap-server 127.0.0.10:9092 --topic first-topic --partition 0
THE MAIN PROBLEM IS, when I moved to java code, my KafkaConsumer class can't fetch records, and I need to know how to specify the partition number in java KafkaConsumer ?!
my current java code is:
public class ConsumerDemo {
public static void main(String[] args) {
Logger logger = LoggerFactory.getLogger((ConsumerDemo.class.getName()));
String bootstrapServer = "127.0.0.10:9092";
String groupId = "my-kafka-java-app";
String topic = "first-topic";
// create consumer configs
Properties properties = new Properties();
properties.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServer);
//properties.setProperty(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, partition);
properties.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, groupId);
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
// create consumer
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(properties);
// subscribe consumer to our topic
consumer.subscribe(Collections.singleton(topic)); //means subscribtion to one topic
// poll for new data
while(true){
//consumer.poll(100); old way
ConsumerRecords<String, String> records =
consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records){
logger.info("Key: " + record.key() + ", Value: "+ record.value() );
logger.info("Partition: " + record.partition() + ", Offset: "+ record.offset());
}
}
}
}
After a lot of inspection, my solution came out to be using consumer.assign and consumer.seek instead of using consumer.subscribe and without specifying the groupId. But I feel there should be a more optimal solution
the java code will be as:
// create consumer
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(properties);
// subscribe consumer to our topic
//consumer.subscribe(Collections.singleton(topic)); //means subscription to one topic
// using assign and Seek, are mostly used to replay data or fetch a specific msg
TopicPartition partitionToReadFrom = new TopicPartition(topic, 0);
long offsetToReadFrom = 15L;
// assign
consumer.assign(Arrays.asList(partitionToReadFrom));
// seek: for a specific offset to read from
consumer.seek(partitionToReadFrom, offsetToReadFrom);
The way you are doing is correct. You don't need to specify the partition when subscribing to a topic. Maybe your consumer group has already consumed all messages in the topic and has committed the latest offsets.
Make sure new messages are being produced when you run your application or create a new consumer group to consume from the beginning (if you keep the ConsumerConfig.AUTO_OFFSET_RESET_CONFIG set to "earliest")
As the name implies, the ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG property aims to configure a Partition Assignment Strategy and no to set a fixed partition as instructed by the command line.
The default strategy used is the RangeAssignor which can be changed, for example to a StickyAssignor as follows:
properties.setProperty(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,StickyAssignor.class.getName());
You can read more about Kafka Client Side Assignment Proposal.

Retrieve from kafka last offsets for each partition in specified topic

Kafka gives useful command line tool kafka.tools.GetOffsetShell, but I need its functionality in my application.
I want to get all offsets for each partition in specified topic, like that:
bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list kafka:9092 --topic com.group.test.Foo
com.group.test.Foo:0:10
com.group.test.Foo:1:11
com.group.test.Foo:2:10
But I don't want to run process bin/kafka-run-class.sh kafka.tools.GetOffsetShell.
How can I do the same using kafka api in Java?
Do I have to create consumer and invoke: KafkaConsumer#position for each TopicPartition? I need simpler way?
By default, GetOffsetShell returns the end offset for each partitions. You could retrieve those offsets programmatically like this:
......
try (final KafkaConsumer<String, String> consumer = new KafkaConsumer<>(consumerProperties)) {
consumer.subscribe(Arrays.asList("topicName"));
Set<TopicPartition> assignment;
while ((assignment = consumer.assignment()).isEmpty()) {
consumer.poll(Duration.ofMillis(100));
}
consumer.endOffsets(assignment).forEach((tp, offset) -> System.out.println(tp + ": " + offset));
}

Kafka streams error : SerializationException: Size of data received by LongDeserializer is not 8

I am trying Kafka Streams. Writing a simple application where I am counting duplicate messages.
Message:
2019-02-27-11:16:56 :: session:prod-111656 :: Msg => Hello World: 2491
2019-02-27-11:16:56 :: session:prod-111656 :: Msg => Hello World: 2492
etc.
I am trying to split such messages by session:prod-xxxx. Use it as key. And session:prod-xxxx+Hello World: xxxx use it as value. Then group by key, and see which messages got duplicated in each session.
Here's the code:
KStream<String, String> textLines = builder.stream("RegularProducer");
KTable<String, Long> ktable = textLines.map(
(String key, String value) -> {
try {
String[] parts = value.split("::");
String sessionId = parts[1];
String message = ((parts[2]).split("=>"))[1];
message = sessionId+":"+message;
return new KeyValue<String,String>(sessionId.trim().toLowerCase(), message.trim().toLowerCase());
} catch (Exception e) {
return new KeyValue<String,String>("Invalid-Message".trim().toLowerCase(), "Invalid Message".trim().toLowerCase());
}
})
.groupBy((key,value) -> value)
.count().filter(
(String key, Long value) -> {
return value > 1;
}
);
ktable.toStream().to("RegularProducerDuplicates",
Produced.with(Serdes.String(), Serdes.Long()));
Topology topology = builder.build();
topology.describe();
KafkaStreams streams = new KafkaStreams(topology, props);
streams.start();
KTable topic RegularProducerDuplicates gets produced. But when I use console-consumer to view it, it crashes with an error. Then I use --skip-message-on-error flag on console-consumer. Now I see thousands of lines like these
session:prod-111656 : hello world: 994 [2019-02-28 16:25:18,081] ERROR Error processing message, skipping this message: (kafka.tools.ConsoleConsumer$)
org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
Can anyone help me what's going wrong here?
Your Kafka Streams application is ok and works properly.
The bug is in kafka-console-consumer (kafka.tools.ConsoleConsumer is class that implements logic for script).
It doesn't properly handle null during deserialization. When it gets null as value or key for a message it sets default value (Array of bytes that representing null String). If you check source code you can find following function
def write(deserializer: Option[Deserializer[_]], sourceBytes: Array[Byte]) {
val nonNullBytes = Option(sourceBytes).getOrElse("null".getBytes(StandardCharsets.UTF_8))
val convertedBytes = deserializer.map(_.deserialize(null, nonNullBytes).toString.
getBytes(StandardCharsets.UTF_8)).getOrElse(nonNullBytes)
output.write(convertedBytes)
}
How you can see when it gets sourceBytes that is null (sourceBytes==null) for deserialization it set default value for that:
val nonNullBytes = Option(sourceBytes).getOrElse("null".getBytes(StandardCharsets.UTF_8))
In your case it is "null".getBytes(StandardCharsets.UTF_8). Then, there is a try of deserialization with org.apache.kafka.common.serialization.LongDeserializer (your value deserializer). LongDeserializer checks at very beginning the size of the Array of bytes. Now it is 4 (byte representation of null) and an exception is thrown.
If you for example use StringDeserializer, it will not deserialize it properly but at least it won't throw an exception, because it doesn't check the length of array of bytes.
Long story short: ConsoleConsumer's formatter, that is responsible for printing, for pretty printing set some default value, that can't be handled by some Deserializers (LongDeserializer, IntegerDeserializer)
Regarding, why your application produce null values for some keys:
The KTable:filter has different semantic than the KStream::filter. According to javadoc for KTable:
for each record that gets dropped (i.e., does not satisfy the given
predicate) a tombstone record is forwarded.
For your filter, when count <= 1 it passes null value for the key.
The Deserializer used for values might not be for String and will be for Long.
While creating consumer in cli do specify it. Ex-
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic name \
--from-beginning \
--formatter kafka.tools.DefaultMessageFormatter \
--property print.key=true \
--property print.value=true \
--skip-message-on-error \
--property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
--property value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
Here check the last 2 lines while creating consumer, take care of the type of your (Key,Values)
In my case both were strings, if values had been of long type, use last line as:
--property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
I had the same issue and found that if I move the filter after toStream, the null values (tombstones) are not produced.

consumer.How to specify partition to read? [kafka]

I am learning Kafka and I want to know how to specify then partition when I consume messages from a topic.
I have found several pictures like this:
It means that a consumer can consume messages from several partitions but a partition can only be read by a single consumer (within a consumer group).
Also, I have read several examples for consumer and they look like this:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "consumer-tutorial");
props.put("key.deserializer", StringDeserializer.class.getName());
props.put("value.deserializer", StringDeserializer.class.getName());
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
and:
Subscribe:
consumer.subscribe(Arrays.asList(“foo”, “bar”));
Poll
try {
while (running) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records)
System.out.println(record.offset() + ": " + record.value());
}
} finally {
consumer.close();
}
How does this work? From which partition will I read messages?
There are two ways to tell what topic/partitions you want to consume: KafkaConsumer#assign() (you specify the partition you want and the offset where you begin) and subscribe (you join a consumer group, and partition/offset will be dynamically assigned by group coordinator depending of consumers in the same consumer group, and may change during runtime)
In both case, you need to poll to receive data.
See https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html, especially paragraphs Consumer Groups and Topic Subscriptions and Manual Partition Assignment

Offset missing from Kafka logs - Simple Consumer unable to proceed

I have a 3-node kafka cluster setup. I am using storm to read messages from kafka. Each topic in my system has 7 partitions.
Now I am facing a weird problem. Till 3 days ago, everything was working fine. However, now it seems my storm topology is unable to read specifically from 2 partitions - #1 and #4.
I tried to drill down to the problem and found that in my kafka logs, for both of these partitions, one offset is missing i.e. after 5964511, next offset is 5964513 and not 5964512.
Due to missing offset, Simple Consumer is not able to proceed to next offsets. Am I doing something wrong or is it a known bug ?
What possibly could be the reason for such behaviour ?
I am using following code to read window of valid offsets :
public static long getLastOffset(SimpleConsumer consumer, String topic, int partition,
long whichTime, String clientName) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, partition);
Map<TopicAndPartition, PartitionOffsetRequestInfo> requestInfoMap = new HashMap<TopicAndPartition, PartitionOffsetRequestInfo>();
requestInfoMap.put(topicAndPartition, new PartitionOffsetRequestInfo(kafka.api.OffsetRequest.LatestTime(), 100));
OffsetRequest request = new OffsetRequest( requestInfoMap, kafka.api.OffsetRequest.CurrentVersion() , clientName);
OffsetResponse response = consumer.getOffsetsBefore(request);
long[] validOffsets = response.offsets(topic, partition);
for (long validOffset : validOffsets) {
System.out.println(validOffset + " : ");
}
long largestOffset = validOffsets[0];
long smallestOffset = validOffsets[validOffsets.length - 1];
System.out.println(smallestOffset + " : " + largestOffset );
return largestOffset;
}
This gives me following output :
4529948 : 6000878
So, the offset I am providing is well within the offset range.
Sorry for the late answer, but...
I code for this case by having a Long instance var to hold the next offset to read and then checking after the fetch to see if the returned FetchResponse hasError(). If there was an error I change the next offset value to a reasonable value (could be the next offset or the last available offset) and try again.

Categories

Resources