Kafka consumer for multiple topic - java

I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics increases, then the number of threads consuming from the topics increases, which I do not want, since the topics are not going to get data too frequently, so the threads will sit ideal.
Is there any way to have a single consumer to consume from all topics? If yes, then how can we achieve it? Also how will the offset be maintained by Kafka? Please suggest answers.

We can subscribe for multiple topic using following API :
consumer.subscribe(Arrays.asList(topic1,topic2), ConsumerRebalanceListener obj)
Consumer has the topic info and we can comit using consumer.commitAsync or consumer.commitSync() by creating OffsetAndMetadata object as follows.
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}

There is no need for multiple threads, you can have one consumer, consuming from multiple topics.
Offsets are maintained by zookeeper, as kafka-server itself is stateless.
Whenever a consumer consumes a message,its offset is commited with zookeeper to keep a future track to process each message only once. So even in case of kafka failure, consumer will start consuming from the next of last commited offset.

Related

How to manage Async in failure?

I'm working on one requirement where I need to consume messages from Kafka broker. The frequency is very high, so that's why I've choosen Async mechanism.
I want to know, while consuming messages, lets say connection break down with broker or broker itself failed due to any reason and offset could not get commit back to broker. So after restarting, I've to consume same messages again which was consumed earlier but not commited back in broker.
private static void startConsumer() {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(5));
for (ConsumerRecord<String, String> record : records) {
System.out.printf("consumed: key = %s, value = %s, partition id= %s, offset = %s%n",
record.key(), record.value(), record.partition(), record.offset());
}
if (records.isEmpty()) {
System.out.println("-- terminating consumer --");
break;
}
printOffsets("before commitAsync() call", consumer, topicPartition);
consumer.commitAsync();
printOffsets("after commitAsync() call", consumer, topicPartition);
}
printOffsets("after consumer loop", consumer, topicPartition);
}
may I know please, what can be done to overcome this situation where I dont need to consume same message again after restart ?
You need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally (or in your case printing it to the logs).
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.

creating one kafka consumer for several topics

I want to create single kafka consumer for several topics. Method constructor for consumer allows me to transfer arguments for a list of topics inside subscription, like that:
private Consumer createConsumer() {
Properties props = getConsumerProps();
Consumer<String, byte[]> consumer = new KafkaConsumer<>(props);
ArrayList<String> topicMISL = new ArrayList<>();
for (String s:Connect2Redshift.kafkaTopics) {
topicMISL.add(systemID + "." + s);
}
consumer.subscribe(topicMISL);
return consumer;
}
private boolean consumeMessages( Duration duration, Consumer<String, byte[]> consumer) {
try { Long start = System.currentTimeMillis();
ConsumerRecords<String, byte[]> consumerRecords = consumer.poll(duration);
}
}
Afterwards I want to poll records from kafka into stream every 3 sec and process them, but I wonder what is inside this consumer - how will records from different topics be polled - at first one topic, then another, or in parallel. Could it be that one topic with large amount of messages would be processed all the time and another topic with small amount of messages would wait?
in general it depends on your topic settings. Kafka scales by using multiple partitions per topic.
If you have 3 partitions on 1 topic, kafka can read from them in parallel
The same is true for multiple topics, reading can happen in parallel
If you have a partition that receives a lot more messages than the others, you may run into the scenario of a consumer lag for this particular partition. Tweaking the batch size and consumer settings may help them, also compressing messages.
Ideally making sure to distribute the load evenly avoids this scenario.
Look into this blog article, it gave me a good understanding of the internals: https://www.confluent.io/blog/configure-kafka-to-minimize-latency/
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
}
}
Also need to do commit for offset by finding offset and commit using consumer.commitSync

Is there a way to get the last message from Kafka topic?

I have a Kafka topic with multiple partitions and I wonder if there is a way in Java to fetch the last message for the topic. I don't care for the partitions I just want to get the latest message.
I have tried #KafkaListener but it fetches the message only when the topic is updated. If there is nothing published after the application is opened nothing is returned.
Maybe the listener is not the right approach to the problem at all?
This following snippet worked for me. You may try this. Explanation in the comments.
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
consumer.subscribe(Collections.singletonList(topic));
consumer.poll(Duration.ofSeconds(10));
consumer.assignment().forEach(System.out::println);
AtomicLong maxTimestamp = new AtomicLong();
AtomicReference<ConsumerRecord<String, String>> latestRecord = new AtomicReference<>();
// get the last offsets for each partition
consumer.endOffsets(consumer.assignment()).forEach((topicPartition, offset) -> {
System.out.println("offset: "+offset);
// seek to the last offset of each partition
consumer.seek(topicPartition, (offset==0) ? offset:offset - 1);
// poll to get the last record in each partition
consumer.poll(Duration.ofSeconds(10)).forEach(record -> {
// the latest record in the 'topic' is the one with the highest timestamp
if (record.timestamp() > maxTimestamp.get()) {
maxTimestamp.set(record.timestamp());
latestRecord.set(record);
}
});
});
System.out.println(latestRecord.get());
You'll have to consume the latest message from each partition and then do a comparison on the client side (using the timestamp on the message, if it contains it). The reason for this is that Kafka does not guarantee inter-partition ordering. Inside a partition, you can be sure that the message with the largest offset is the latest message pushed to it.

Kafka one consumer one partition

I have a use case where I have a single topic with 100 partitions where messages go in each partition with some logic and I have 100 consumers who reads this message. I want to map a specific partition to a specific consumer. How can I achieve that?
Checkout the Javadoc for the KafkaConsumer, specifically the section "Manual Partition Assignment".
TL/DR
You can manually assign specific partitions to a consumer as follows:
String topic = "foo";
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0, partition1));

Kafka pattern subscription. Rebalancing is not being triggered on new topic

According to the documentation on kafka javadocs if I:
Subscribe to a pattern
Create a topic that matches the pattern
A rebalance should occur, which makes the consumer read from that new topic. But that's not happening.
If I stop and start the consumer, it does pick up the new topic. So I know the new topic matches the pattern. There's a possible duplicate of this question in https://stackoverflow.com/questions/37120537/whitelist-filter-in-kafka-doesnt-pick-up-new-topics but that question got nowhere.
I'm seeing the kafka logs and there are no errors, it just doesn't trigger a rebalance. The rebalance is triggered when consumers join or die, but not when new topics are created (not even when partitions are added to existing topics, but that's another subject).
I'm using kafka 0.10.0.0, and the official Java client for the "New Consumer API", meaning broker GroupCoordinator instead of fat client + zookeeper.
This is the code for the sample consumer:
public class SampleConsumer {
public static void main(String[] args) throws IOException {
KafkaConsumer<String, String> consumer;
try (InputStream props = Resources.getResource("consumer.props").openStream()) {
Properties properties = new Properties();
properties.load(props);
properties.setProperty("group.id", "my-group");
System.out.println(properties.get("group.id"));
consumer = new KafkaConsumer<>(properties);
}
Pattern pattern = Pattern.compile("mytopic.+");
consumer.subscribe(pattern, new SampleRebalanceListener());
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("%s %s\n", record.topic(), record.value());
}
}
}
}
In the producer, I'm sending messages to topics named mytopic1, mytopic2, etc.
Patterns are pretty much useless if the rebalance is not triggered.
Do you know why the rebalance is not happening?
The documentation mentions "The pattern matching will be done periodically against topics existing at the time of check.". It turns out the "periodically" corresponds to the metadata.max.age.ms property. By setting that property (inside "consumer.props" in my code sample) to i.e. 5000 I can see it detects new topics and partitions every 5 seconds.
This is as designed, according to this jira ticket https://issues.apache.org/jira/browse/KAFKA-3854:
The final note on the JIRA stating that a later created topic that matches a consumer's subscription pattern would not be assigned to the consumer upon creation seems to be as designed. A repeat subscribe() to the same pattern would be needed to handle that case.
The refresh metadata polling does the "repeat subscribe()" mentioned in the ticket.
This is confusing coming from Kafka 0.8 where there was true triggering based on zookeper watches, instead of polling. IMO 0.9 is more of a downgrade for this scenario, instead of "just in time" rebalancing, this becomes either high frequency polling with overhead, or low frequency polling with long times before it reacts to new topics/partitions.
to trigger a rebalance immediately, you can explicitly make a poll call after subscribe to the topic:
kafkaConsumer.poll(pollDuration);
refer to:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-568%3A+Explicit+rebalance+triggering+on+the+Consumer
In your consumer code, use the following:
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, EARLIEST)
and try again

Categories

Resources