Kafka one consumer one partition - java

I have a use case where I have a single topic with 100 partitions where messages go in each partition with some logic and I have 100 consumers who reads this message. I want to map a specific partition to a specific consumer. How can I achieve that?

Checkout the Javadoc for the KafkaConsumer, specifically the section "Manual Partition Assignment".
TL/DR
You can manually assign specific partitions to a consumer as follows:
String topic = "foo";
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0, partition1));

Related

creating one kafka consumer for several topics

I want to create single kafka consumer for several topics. Method constructor for consumer allows me to transfer arguments for a list of topics inside subscription, like that:
private Consumer createConsumer() {
Properties props = getConsumerProps();
Consumer<String, byte[]> consumer = new KafkaConsumer<>(props);
ArrayList<String> topicMISL = new ArrayList<>();
for (String s:Connect2Redshift.kafkaTopics) {
topicMISL.add(systemID + "." + s);
}
consumer.subscribe(topicMISL);
return consumer;
}
private boolean consumeMessages( Duration duration, Consumer<String, byte[]> consumer) {
try { Long start = System.currentTimeMillis();
ConsumerRecords<String, byte[]> consumerRecords = consumer.poll(duration);
}
}
Afterwards I want to poll records from kafka into stream every 3 sec and process them, but I wonder what is inside this consumer - how will records from different topics be polled - at first one topic, then another, or in parallel. Could it be that one topic with large amount of messages would be processed all the time and another topic with small amount of messages would wait?
in general it depends on your topic settings. Kafka scales by using multiple partitions per topic.
If you have 3 partitions on 1 topic, kafka can read from them in parallel
The same is true for multiple topics, reading can happen in parallel
If you have a partition that receives a lot more messages than the others, you may run into the scenario of a consumer lag for this particular partition. Tweaking the batch size and consumer settings may help them, also compressing messages.
Ideally making sure to distribute the load evenly avoids this scenario.
Look into this blog article, it gave me a good understanding of the internals: https://www.confluent.io/blog/configure-kafka-to-minimize-latency/
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
}
}
Also need to do commit for offset by finding offset and commit using consumer.commitSync

Is there a way to get the last message from Kafka topic?

I have a Kafka topic with multiple partitions and I wonder if there is a way in Java to fetch the last message for the topic. I don't care for the partitions I just want to get the latest message.
I have tried #KafkaListener but it fetches the message only when the topic is updated. If there is nothing published after the application is opened nothing is returned.
Maybe the listener is not the right approach to the problem at all?
This following snippet worked for me. You may try this. Explanation in the comments.
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(properties);
consumer.subscribe(Collections.singletonList(topic));
consumer.poll(Duration.ofSeconds(10));
consumer.assignment().forEach(System.out::println);
AtomicLong maxTimestamp = new AtomicLong();
AtomicReference<ConsumerRecord<String, String>> latestRecord = new AtomicReference<>();
// get the last offsets for each partition
consumer.endOffsets(consumer.assignment()).forEach((topicPartition, offset) -> {
System.out.println("offset: "+offset);
// seek to the last offset of each partition
consumer.seek(topicPartition, (offset==0) ? offset:offset - 1);
// poll to get the last record in each partition
consumer.poll(Duration.ofSeconds(10)).forEach(record -> {
// the latest record in the 'topic' is the one with the highest timestamp
if (record.timestamp() > maxTimestamp.get()) {
maxTimestamp.set(record.timestamp());
latestRecord.set(record);
}
});
});
System.out.println(latestRecord.get());
You'll have to consume the latest message from each partition and then do a comparison on the client side (using the timestamp on the message, if it contains it). The reason for this is that Kafka does not guarantee inter-partition ordering. Inside a partition, you can be sure that the message with the largest offset is the latest message pushed to it.

Kafka consumer for multiple topic

I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics increases, then the number of threads consuming from the topics increases, which I do not want, since the topics are not going to get data too frequently, so the threads will sit ideal.
Is there any way to have a single consumer to consume from all topics? If yes, then how can we achieve it? Also how will the offset be maintained by Kafka? Please suggest answers.
We can subscribe for multiple topic using following API :
consumer.subscribe(Arrays.asList(topic1,topic2), ConsumerRebalanceListener obj)
Consumer has the topic info and we can comit using consumer.commitAsync or consumer.commitSync() by creating OffsetAndMetadata object as follows.
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
There is no need for multiple threads, you can have one consumer, consuming from multiple topics.
Offsets are maintained by zookeeper, as kafka-server itself is stateless.
Whenever a consumer consumes a message,its offset is commited with zookeeper to keep a future track to process each message only once. So even in case of kafka failure, consumer will start consuming from the next of last commited offset.

Kafka pattern subscription. Rebalancing is not being triggered on new topic

According to the documentation on kafka javadocs if I:
Subscribe to a pattern
Create a topic that matches the pattern
A rebalance should occur, which makes the consumer read from that new topic. But that's not happening.
If I stop and start the consumer, it does pick up the new topic. So I know the new topic matches the pattern. There's a possible duplicate of this question in https://stackoverflow.com/questions/37120537/whitelist-filter-in-kafka-doesnt-pick-up-new-topics but that question got nowhere.
I'm seeing the kafka logs and there are no errors, it just doesn't trigger a rebalance. The rebalance is triggered when consumers join or die, but not when new topics are created (not even when partitions are added to existing topics, but that's another subject).
I'm using kafka 0.10.0.0, and the official Java client for the "New Consumer API", meaning broker GroupCoordinator instead of fat client + zookeeper.
This is the code for the sample consumer:
public class SampleConsumer {
public static void main(String[] args) throws IOException {
KafkaConsumer<String, String> consumer;
try (InputStream props = Resources.getResource("consumer.props").openStream()) {
Properties properties = new Properties();
properties.load(props);
properties.setProperty("group.id", "my-group");
System.out.println(properties.get("group.id"));
consumer = new KafkaConsumer<>(properties);
}
Pattern pattern = Pattern.compile("mytopic.+");
consumer.subscribe(pattern, new SampleRebalanceListener());
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("%s %s\n", record.topic(), record.value());
}
}
}
}
In the producer, I'm sending messages to topics named mytopic1, mytopic2, etc.
Patterns are pretty much useless if the rebalance is not triggered.
Do you know why the rebalance is not happening?
The documentation mentions "The pattern matching will be done periodically against topics existing at the time of check.". It turns out the "periodically" corresponds to the metadata.max.age.ms property. By setting that property (inside "consumer.props" in my code sample) to i.e. 5000 I can see it detects new topics and partitions every 5 seconds.
This is as designed, according to this jira ticket https://issues.apache.org/jira/browse/KAFKA-3854:
The final note on the JIRA stating that a later created topic that matches a consumer's subscription pattern would not be assigned to the consumer upon creation seems to be as designed. A repeat subscribe() to the same pattern would be needed to handle that case.
The refresh metadata polling does the "repeat subscribe()" mentioned in the ticket.
This is confusing coming from Kafka 0.8 where there was true triggering based on zookeper watches, instead of polling. IMO 0.9 is more of a downgrade for this scenario, instead of "just in time" rebalancing, this becomes either high frequency polling with overhead, or low frequency polling with long times before it reacts to new topics/partitions.
to trigger a rebalance immediately, you can explicitly make a poll call after subscribe to the topic:
kafkaConsumer.poll(pollDuration);
refer to:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-568%3A+Explicit+rebalance+triggering+on+the+Consumer
In your consumer code, use the following:
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, EARLIEST)
and try again

Specify number of partitions on Kafka producer

I have the following code from the web page https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example
What seems to be missing is how to configure the number of partitions. I want to specify 4 partitions but am always ending up with the default of 2 partitions. How do I change the code to have 4 partitions(without changing the default).
Properties props = new Properties();
props.put("metadata.broker.list", "localhost:9092,broker2:9092");
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("partitioner.class", "com.gnip.kafka.SimplePartitioner");
props.put("request.required.acks", "1");
props.put("num.partitions", 4);
ProducerConfig config = new ProducerConfig(props);
Producer<String, String> producer = new Producer<String, String>(config);
Random rnd = new Random();
for (long nEvents = 0; nEvents < 1000; nEvents++) {
long runtime = new Date().getTime();
String ip = "192.168.2." + rnd.nextInt(255);
String msg = runtime + ",www.example.com," + ip;
KeyedMessage<String, String> data = new KeyedMessage<String, String>("page_visits2", ip, msg);
producer.send(data);
}
producer.close();
The Kafka producer api does not allow you to create custom partition, if you try to produce some data to a topic which does not exists it will first create the topic if the auto.create.topics.enable property in the BrokerConfig is set to TRUE and start publishing data on the same but the number of partitions created for this topic will based on the num.partitions parameter defined in the configuration files (by default it is set to one).
Increasing partition count for an existing topic can be done, but it'll not move any existing data into those partitions.
To create a topic with different number of partition you need to create the topic first and the same can be done with the console script that shipped along with the Kafka distribution. The following command will allow you to create a topic with 2 partition (as specified by the --partition flag)
bin/kafka-create-topic.sh --zookeeper localhost:2181 --replica 3 --partition 2 --topic my-custom-topic
Unfortunately as far as my understanding goes currently there is no direct alternative to achieve this.
The number of partitions is a broker property and will not have any effect for the producer, see here.
As the producer example page shows, you can use a custom partitioner to route messages as you prefer but new partitions will not be created if not defined in the broker properties.

Categories

Resources