Consume multiple topics at the same time from single consumer - java

There are multiple topics and these topics have no key value relation. The first payload of topic1 (has no key) example as it follows:
{ "bookList": [{"bookId": "1"}, {"bookId": "2"} ],"magazineList": [{"magazineId": "1"}, {"magazineId": "2"} ]
the second payload of topic2 (has a random integer key value) is:
{ "libraryId": "1", "cityId": "1" }
Let's assume these payloads consume at the same t time and these topics in the same consumer group. What I am trying to do is to consume these topic1 and topic2 at the same time (maybe using stream) and aggregate/process these payloads. What should I do?
Some source say the topics should have the same key to process different topics but I am new to Kafka and couldn't find my answer. Is the below approach correct?
KafkaConsumer<String,String> Consumer = new KafkaConsumer<String,String>(properties);
Consumer.subscribe("topic1");
Consumer.subscribe("topic2");
while (true) {
ConsumerRecords<Integer,String> records=Consumer.poll(Duration.ofMillis(100));
for(ConsumerRecord<String,String> record: records){
System.out.println(record);
}
}

consume these topic1 and topic2 at the same time
Use a regex pattern
Consumer.subscribe("topic[12]");
aggregate/process these payloads. What should I do?
Don't use the consumer API for this. Use Kafka Streams.
Some source say the topics should have the same key
To join the data, yes. Not to consume both, in parallel.

Related

creating one kafka consumer for several topics

I want to create single kafka consumer for several topics. Method constructor for consumer allows me to transfer arguments for a list of topics inside subscription, like that:
private Consumer createConsumer() {
Properties props = getConsumerProps();
Consumer<String, byte[]> consumer = new KafkaConsumer<>(props);
ArrayList<String> topicMISL = new ArrayList<>();
for (String s:Connect2Redshift.kafkaTopics) {
topicMISL.add(systemID + "." + s);
}
consumer.subscribe(topicMISL);
return consumer;
}
private boolean consumeMessages( Duration duration, Consumer<String, byte[]> consumer) {
try { Long start = System.currentTimeMillis();
ConsumerRecords<String, byte[]> consumerRecords = consumer.poll(duration);
}
}
Afterwards I want to poll records from kafka into stream every 3 sec and process them, but I wonder what is inside this consumer - how will records from different topics be polled - at first one topic, then another, or in parallel. Could it be that one topic with large amount of messages would be processed all the time and another topic with small amount of messages would wait?
in general it depends on your topic settings. Kafka scales by using multiple partitions per topic.
If you have 3 partitions on 1 topic, kafka can read from them in parallel
The same is true for multiple topics, reading can happen in parallel
If you have a partition that receives a lot more messages than the others, you may run into the scenario of a consumer lag for this particular partition. Tweaking the batch size and consumer settings may help them, also compressing messages.
Ideally making sure to distribute the load evenly avoids this scenario.
Look into this blog article, it gave me a good understanding of the internals: https://www.confluent.io/blog/configure-kafka-to-minimize-latency/
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
}
}
Also need to do commit for offset by finding offset and commit using consumer.commitSync

Why is the "topics" argument of KafkaUtils.createStream() a Map rather then array?

Definition in docs:
org.apache.spark.streaming.kafka
Class KafkaUtils
static JavaPairReceiverInputDStream<String,String> createStream(JavaStreamingContext jssc, String zkQuorum, String groupId, java.util.Map<String,Integer> topics)
Create an input stream that pulls messages from Kafka Brokers.
Why is topics a Map (rather than a string array)?
I understand that the string key is the topic name. But what about the integer value? What should I fill in?
Read the Javadoc:
public static JavaPairReceiverInputDStream createStream(JavaStreamingContext jssc,
String zkQuorum,
String groupId,
java.util.Map topics)
Create an input stream that pulls messages from Kafka Brokers. Storage level of the data will be the default StorageLevel.MEMORY_AND_DISK_SER_2.
Parameters:
jssc - JavaStreamingContext object
zkQuorum - Zookeeper quorum (hostname:port,hostname:port,..)
groupId - The group id for this consumer
topics - Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
Returns:
DStream of (Kafka message key, Kafka message value)
The value of the Map is the number of partitions of the given topic name, which determines the number of threads that will be used to consume the topic.
If you see the documentation of the createStream method of KafkaUtils here, you'd see
topics - Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
The Integer value is the number of partitions for the topic as part of the key in the map.
From Javadoc: https://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html
topics - Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
so each number is the number of partitions you want to use for that topic

Kafka one consumer one partition

I have a use case where I have a single topic with 100 partitions where messages go in each partition with some logic and I have 100 consumers who reads this message. I want to map a specific partition to a specific consumer. How can I achieve that?
Checkout the Javadoc for the KafkaConsumer, specifically the section "Manual Partition Assignment".
TL/DR
You can manually assign specific partitions to a consumer as follows:
String topic = "foo";
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0, partition1));

Kafka consumer for multiple topic

I have a list of topics (for now it's 10) whose size can increase in future. I know we can spawn multiple threads (per topic) to consume from each topic, but in my case if the number of topics increases, then the number of threads consuming from the topics increases, which I do not want, since the topics are not going to get data too frequently, so the threads will sit ideal.
Is there any way to have a single consumer to consume from all topics? If yes, then how can we achieve it? Also how will the offset be maintained by Kafka? Please suggest answers.
We can subscribe for multiple topic using following API :
consumer.subscribe(Arrays.asList(topic1,topic2), ConsumerRebalanceListener obj)
Consumer has the topic info and we can comit using consumer.commitAsync or consumer.commitSync() by creating OffsetAndMetadata object as follows.
ConsumerRecords<String, String> records = consumer.poll(long value);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
There is no need for multiple threads, you can have one consumer, consuming from multiple topics.
Offsets are maintained by zookeeper, as kafka-server itself is stateless.
Whenever a consumer consumes a message,its offset is commited with zookeeper to keep a future track to process each message only once. So even in case of kafka failure, consumer will start consuming from the next of last commited offset.

Kafka pattern subscription. Rebalancing is not being triggered on new topic

According to the documentation on kafka javadocs if I:
Subscribe to a pattern
Create a topic that matches the pattern
A rebalance should occur, which makes the consumer read from that new topic. But that's not happening.
If I stop and start the consumer, it does pick up the new topic. So I know the new topic matches the pattern. There's a possible duplicate of this question in https://stackoverflow.com/questions/37120537/whitelist-filter-in-kafka-doesnt-pick-up-new-topics but that question got nowhere.
I'm seeing the kafka logs and there are no errors, it just doesn't trigger a rebalance. The rebalance is triggered when consumers join or die, but not when new topics are created (not even when partitions are added to existing topics, but that's another subject).
I'm using kafka 0.10.0.0, and the official Java client for the "New Consumer API", meaning broker GroupCoordinator instead of fat client + zookeeper.
This is the code for the sample consumer:
public class SampleConsumer {
public static void main(String[] args) throws IOException {
KafkaConsumer<String, String> consumer;
try (InputStream props = Resources.getResource("consumer.props").openStream()) {
Properties properties = new Properties();
properties.load(props);
properties.setProperty("group.id", "my-group");
System.out.println(properties.get("group.id"));
consumer = new KafkaConsumer<>(properties);
}
Pattern pattern = Pattern.compile("mytopic.+");
consumer.subscribe(pattern, new SampleRebalanceListener());
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("%s %s\n", record.topic(), record.value());
}
}
}
}
In the producer, I'm sending messages to topics named mytopic1, mytopic2, etc.
Patterns are pretty much useless if the rebalance is not triggered.
Do you know why the rebalance is not happening?
The documentation mentions "The pattern matching will be done periodically against topics existing at the time of check.". It turns out the "periodically" corresponds to the metadata.max.age.ms property. By setting that property (inside "consumer.props" in my code sample) to i.e. 5000 I can see it detects new topics and partitions every 5 seconds.
This is as designed, according to this jira ticket https://issues.apache.org/jira/browse/KAFKA-3854:
The final note on the JIRA stating that a later created topic that matches a consumer's subscription pattern would not be assigned to the consumer upon creation seems to be as designed. A repeat subscribe() to the same pattern would be needed to handle that case.
The refresh metadata polling does the "repeat subscribe()" mentioned in the ticket.
This is confusing coming from Kafka 0.8 where there was true triggering based on zookeper watches, instead of polling. IMO 0.9 is more of a downgrade for this scenario, instead of "just in time" rebalancing, this becomes either high frequency polling with overhead, or low frequency polling with long times before it reacts to new topics/partitions.
to trigger a rebalance immediately, you can explicitly make a poll call after subscribe to the topic:
kafkaConsumer.poll(pollDuration);
refer to:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-568%3A+Explicit+rebalance+triggering+on+the+Consumer
In your consumer code, use the following:
properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, EARLIEST)
and try again

Categories

Resources