I have a Kafka topic with multiple partitions and I wonder if there is a way in Java to fetch the last message for the topic. I don't care for the partitions I just want to get the latest message.
I have tried #KafkaListener but it fetches the message only when the topic is updated. If there is nothing published after the application is opened nothing is returned.
Maybe the listener is not the right approach to the problem at all?
This following snippet worked for me. You may try this. Explanation in the comments.
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
consumer.subscribe(Collections.singletonList(topic));
consumer.poll(Duration.ofSeconds(10));
consumer.assignment().forEach(System.out::println);
AtomicLong maxTimestamp = new AtomicLong();
AtomicReference<ConsumerRecord<String, String>> latestRecord = new AtomicReference<>();
// get the last offsets for each partition
consumer.endOffsets(consumer.assignment()).forEach((topicPartition, offset) -> {
System.out.println("offset: "+offset);
// seek to the last offset of each partition
consumer.seek(topicPartition, (offset==0) ? offset:offset - 1);
// poll to get the last record in each partition
consumer.poll(Duration.ofSeconds(10)).forEach(record -> {
// the latest record in the 'topic' is the one with the highest timestamp
if (record.timestamp() > maxTimestamp.get()) {
maxTimestamp.set(record.timestamp());
latestRecord.set(record);
}
});
});
System.out.println(latestRecord.get());
You'll have to consume the latest message from each partition and then do a comparison on the client side (using the timestamp on the message, if it contains it). The reason for this is that Kafka does not guarantee inter-partition ordering. Inside a partition, you can be sure that the message with the largest offset is the latest message pushed to it.
Related
I'm working on one requirement where I need to consume messages from Kafka broker. The frequency is very high, so that's why I've choosen Async mechanism.
I want to know, while consuming messages, lets say connection break down with broker or broker itself failed due to any reason and offset could not get commit back to broker. So after restarting, I've to consume same messages again which was consumed earlier but not commited back in broker.
private static void startConsumer() {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(5));
for (ConsumerRecord<String, String> record : records) {
System.out.printf("consumed: key = %s, value = %s, partition id= %s, offset = %s%n",
record.key(), record.value(), record.partition(), record.offset());
}
if (records.isEmpty()) {
System.out.println("-- terminating consumer --");
break;
}
printOffsets("before commitAsync() call", consumer, topicPartition);
consumer.commitAsync();
printOffsets("after commitAsync() call", consumer, topicPartition);
}
printOffsets("after consumer loop", consumer, topicPartition);
}
may I know please, what can be done to overcome this situation where I dont need to consume same message again after restart ?
You need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally (or in your case printing it to the logs).
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
I have written a code to push data to kafka topic on daily basis, but there are few issue which i am not sure this code will be able to handle. my responsibility is to push complete data from a live table which holds 1 day data(refreshed every day morning)
my code will query "select * from mytable" and push it one by one to kafka topic as before pushing i need to validate/alter each row and push to topic.
below is my producer send code.
Properties configProperties = new Properties();
configProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, sBOOTSTRAP_SERVERS_CONFIG);
configProperties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
configProperties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
configProperties.put("acks", "all");
configProperties.put("retries", 0);
configProperties.put("batch.size", 15000);
configProperties.put("linger.ms", 1);
configProperties.put("buffer.memory", 30000000);
#SuppressWarnings("resource")
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(configProperties);
System.out.println("Starting Kafka producer job " + new Date());
producer.send(new ProducerRecord<String, String>(eventName, jsonRec.toString()), new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null) {
e.printStackTrace();
}
}
});
Now, i am not sure how to push data back again into topic in case of failure. Since i have selected all the records from table and few of it got failed and i do not know which all.
Below is what i want to address
how can process only those records which are not pushed to avoid duplicate record being push(avoid redundancy).
how to validate the records pushed are exactly same as in table. i mean the data integrity. like size of data and count of records been pushed.
You can use configProperties.put("enable.idempotence", true); - it will try to retry failed messages but make sure there will be just one of each record saved in kafka. Note that it implies that retries>0 acks=all and max.in.flight.requests.per.connection >=0. For details check https://kafka.apache.org/documentation/.
For 2nd question - if you mean that you need to save all records or none then you have to use kafka transactions, which brings a lot more questions, I would recommend reading https://www.confluent.io/blog/transactions-apache-kafka/
I have a use case where I have a single topic with 100 partitions where messages go in each partition with some logic and I have 100 consumers who reads this message. I want to map a specific partition to a specific consumer. How can I achieve that?
Checkout the Javadoc for the KafkaConsumer, specifically the section "Manual Partition Assignment".
TL/DR
You can manually assign specific partitions to a consumer as follows:
String topic = "foo";
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0, partition1));
I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics increases, then the number of threads consuming from the topics increases, which I do not want, since the topics are not going to get data too frequently, so the threads will sit ideal.
Is there any way to have a single consumer to consume from all topics? If yes, then how can we achieve it? Also how will the offset be maintained by Kafka? Please suggest answers.
We can subscribe for multiple topic using following API :
consumer.subscribe(Arrays.asList(topic1,topic2), ConsumerRebalanceListener obj)
Consumer has the topic info and we can comit using consumer.commitAsync or consumer.commitSync() by creating OffsetAndMetadata object as follows.
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
There is no need for multiple threads, you can have one consumer, consuming from multiple topics.
Offsets are maintained by zookeeper, as kafka-server itself is stateless.
Whenever a consumer consumes a message,its offset is commited with zookeeper to keep a future track to process each message only once. So even in case of kafka failure, consumer will start consuming from the next of last commited offset.
According to the documentation on kafka javadocs if I:
Subscribe to a pattern
Create a topic that matches the pattern
A rebalance should occur, which makes the consumer read from that new topic. But that's not happening.
If I stop and start the consumer, it does pick up the new topic. So I know the new topic matches the pattern. There's a possible duplicate of this question in https://stackoverflow.com/questions/37120537/whitelist-filter-in-kafka-doesnt-pick-up-new-topics but that question got nowhere.
I'm seeing the kafka logs and there are no errors, it just doesn't trigger a rebalance. The rebalance is triggered when consumers join or die, but not when new topics are created (not even when partitions are added to existing topics, but that's another subject).
I'm using kafka 0.10.0.0, and the official Java client for the "New Consumer API", meaning broker GroupCoordinator instead of fat client + zookeeper.
This is the code for the sample consumer:
public class SampleConsumer {
public static void main(String[] args) throws IOException {
KafkaConsumer<String, String> consumer;
try (InputStream props = Resources.getResource("consumer.props").openStream()) {
Properties properties = new Properties();
properties.load(props);
properties.setProperty("group.id", "my-group");
System.out.println(properties.get("group.id"));
consumer = new KafkaConsumer<>(properties);
}
Pattern pattern = Pattern.compile("mytopic.+");
consumer.subscribe(pattern, new SampleRebalanceListener());
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("%s %s\n", record.topic(), record.value());
}
}
}
}
In the producer, I'm sending messages to topics named mytopic1, mytopic2, etc.
Patterns are pretty much useless if the rebalance is not triggered.
Do you know why the rebalance is not happening?
The documentation mentions "The pattern matching will be done periodically against topics existing at the time of check.". It turns out the "periodically" corresponds to the metadata.max.age.ms property. By setting that property (inside "consumer.props" in my code sample) to i.e. 5000 I can see it detects new topics and partitions every 5 seconds.
This is as designed, according to this jira ticket https://issues.apache.org/jira/browse/KAFKA-3854:
The final note on the JIRA stating that a later created topic that matches a consumer's subscription pattern would not be assigned to the consumer upon creation seems to be as designed. A repeat subscribe() to the same pattern would be needed to handle that case.
The refresh metadata polling does the "repeat subscribe()" mentioned in the ticket.
This is confusing coming from Kafka 0.8 where there was true triggering based on zookeper watches, instead of polling. IMO 0.9 is more of a downgrade for this scenario, instead of "just in time" rebalancing, this becomes either high frequency polling with overhead, or low frequency polling with long times before it reacts to new topics/partitions.
to trigger a rebalance immediately, you can explicitly make a poll call after subscribe to the topic:
kafkaConsumer.poll(pollDuration);
refer to:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-568%3A+Explicit+rebalance+triggering+on+the+Consumer
In your consumer code, use the following:
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, EARLIEST)
and try again