Kafka: Single consumer group in multiple instances

Kafka: Single consumer group in multiple instances - java

I am working on implementing a Kafka based solution to our application.
As per the Kafka documentation, what i understand is one consumer in a consumer group (which is a thread) is internally mapped to one partition in the subscribed topic.
Let's say i have a topic with 40 partitions and i have a high level consumer running in 4 instances. I do not want one instance to consume the same messages consumed by another instance. But if one instance goes down, the other three instances should be able to process all the messages.
Should i go for same consumer group with 10 threads per instance?
- Stackoverflow says same consumer group between the instances act as traditional synchronous queue mechanism
In Apache Kafka why can't there be more consumer instances than partitions?
Or Should i go for different consumer group per instance?
Using simple consumer or low level consumer gives control over the partition but then if one instance goes down, the other three instances would not process the messages from the partitions consumed in first instance

First to explain the concept of Consumers & Consumer Groups,
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group.
The records will be effectively load balanced over the consumer instances in a consumer group. If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Now to answer your questions,
1. I do not want one instance to consume the same messages consumed by another instance. But if one instance goes down, the other three instances should be able to process all the messages.
This is possible by default in Kafka architecture. You just have to label all the 4 instances with the same consumer group name.
2. Should i go for same consumer group with 10 threads per instance ?
Doing this will assign each thread a kafka partition from which it will consume data, which is optimal. Reducing the number of threads will load balance the record distribution among the consumer instances and MAY overload some of the consumer instances.
3. In Apache Kafka why can't there be more consumer instances than partitions?
In Kafka, a partition can be assigned only to one consumer instance. Thus, creating more consumer instances than partitions will lead to idle consumers who will not be consuming any records from kafka.
4. Should i go for different consumer group per instance?
No. This will lead to duplication of the records, as every record will be sent to all the instances, as they are from different consumer groups.
Hope this clarifies your doubts.

There are few things to note when designing your Kafka echo system:
Consumer is essentially a thread and you do not want multiple thread trying to change your offset mark. That's why the consumer system should be designed as one consumer one thread.
Offset commits, there a delicate balance between how frequently you want to perform offset commits. If the frequency is higher then it will have an adverse effect on performance of your system (Zk will be the bottleneck). If the frequency is two low then you may risk duplicate messages.

In Kafka you have both ways to do competing-consumers and publish-subscribe patterns:
competing consumers : it's possible putting consumers inside the same consumer group. So that each partition is accessible by only one consumer (of course a consumer can read more than one partition). It means that you can't have more consumers than partitions in a consumer group, because the other consumers will be idle without being assigned any partition. Of course if one consumer in the consumer group goes down, one of the idle consumer will take the partition.
publish subscribe : if you have different consumer groups, all consumers in different consumer groups will receive same messages. Inside the consumer group then, the above pattern will be applied.

Related

2 different consumer application instances with the same group id consuming messages from the same topic

What happens If there are 2 different consumer application instances with the same group id consume from the same topic ?
Example : If the consumer application is developed by 2 different teams, running parallel with the same group id and reading from the same kafka topic.

In Kafka we have Consumer and Consumer groups.
Kafka builds on the publish-subscribe model with the advantages of a message queuing system. It achieves this with-
the use of consumer groups
message retention by brokers
Consumers that share the same group id are part of the same Consumer group. The Consumers from the same Consumer Group works as message queuing system, so if a message is being consumed by one of the consumers, it won't be available to be consumed by another consumer in the same group.
Consider the below picture-
Here we have two different Consumer groups, A and B, and as we can see the Consumers in the same group are not assigned the same Topic Partition.
The consumers in a group divides the topic partitions as fairly amongst themselves as possible by establishing that each partition is only consumed by a single consumer from the group.

Kafka message delivery semantic

I'm reading Kafka documentation about consumers and faced the following message consumption definition:
Our topic is divided into a set of totally ordered partitions, each of
which is consumed by exactly one consumer within each subscribing
consumer group at any given time. This means that the position of
a consumer in each partition is just a single integer, the offset
of the next message to consume.
I interpreted the wording as follows:
A consumer group reads data from a topic consisting of a number of partitions. Then each consumer from the group is assigned with some subset of partitions that do not overlap with other consumer's partitions from the group.
Consider the following case:
A consumer group GRP consisting of 2 consumers C1 and C2 reads data from a topic TPC consisting of 2 partitions P1 and P2.
QUESTION: If at some point C1 reads from P1 and C2 reads from P2 can it be rebalanced so that C1 starts reading from P2 and C2 from P1. If so under which condition may that happen?
It does not contradict to the quote above.

I see a few things to be discussed in your question and comment.
Your interpretation of the quoted paragraph is correct.
Question "If so under which condition may that happen?":
Yes, this scenario can happen. A change in the assignment of a consumer to a TopicPartition is mainly triggered through a rebalancing. A consumer rebalance will be triggered in the following cases:
Consumer rebalances are initiated when
A Consumer leaves the Consumer group (either by failing to send a timely heartbeat or by explicitly requesting to leave)
A new Consumer joins the Consumer Group
A Consumer changes its Topic subscription
The Consumer Group notices a change to the Topic metadata for any subscribed Topic
(e.g. an increase in the number of Partitions)
[Source: Training Material of Confluent Kafka Developer]
Keep in mind, that during a Rebalance all consumers are paused.
Your comment "C1 read some message from P1 without commiting offset. Then it loses the connection to Kafka and processes the message succesfully. At the same time the other consumer C3 is created and assigned to the P1 reading the same message."
I see this scenario unrelated to a consumer rebalance, as your consumer C1 could just die after processing the data but before committing the back to Kafka. Now, if you restart the consumer C1 it will read the same messages again because it did not yet commit them.
This is called "at-least-once" delivery semantics and is different to "at-most-once" semantics when you have e.g. auto.commit enabled. I guess you are looking for the "holy grail" in distributed systems which is "exactly-once-semantics" :)
For this to achieve you need to consider the entire application from Kafka to the sink of your application. If the output of your application is not idempotent you are likely not able to achieve exactly-once semantics (EOS). But if your output sink e.g. is Kafka again you actually can achieve EOS.

How to successfully perform poll on second Kafka consumer within the same thread?

I have two Kafka consumers, subscribed to different topics and belonging to the same consumer group, having different consumer ids, running in the same thread. They are executing poll sequentially but after the first is done second seems to be stuck in poll.
I tried associating them with different consumer groups and that seems to be working but unfortunately, that is not a viable solution for me.
I found this in "Kafka: The Definitive Guide":
You can’t have multiple consumers that belong to the same group in one thread and you can’t have multiple threads safely use the same consumer. One consumer per thread is the rule.
That quote directs me towards some form of thread cooperation due to the specific order of message processing I need to do.
Can someone provide an explanation of why is it necessary to run different consumers belonging to the same consumer group, subscribed to different topics in separate threads?
Thank you.

Explanation from Kafka users mailing list
Matthias J. Sax: "Why does it block? Well, after the first consumer calls poll(), it will
join the group and the group will have 1 member. When the second
consumer calls poll() it will try to join the group. The broker side
group coordinator knows that was already one consumer in the group and
it will wait until the first consumer say "I am still here and still
part of the group" -- however, this will never happen, because the first
consumer would do this via calling poll(), but it can't, because the
second consumer is stuck in its own poll() call to actually join the
group. Eventually the first consumer would time out and be remove from
the group and the second consumer unblocks (the group has still one
member only). Afterward the first consumer will retry to join the
group... etc. etc. This ping pong game will continue forever..."

KafkaConsumer Java API subscribe() vs assign()

I am new with Kafka Java API and I am working on consuming records from a particular Kafka topic.
I understand that I can use method subscribe() to start polling records from the topic. Kafka also provides method assign() if I want to start polling records from selected partitions of the topics.
I want to understand if this is the only difference between the two?

Yes subscribe need group.id because each consumer in a group will dynamically assigned to partitions for list of topics provided in subscribe method and each partition can be consumed by one consumer thread in that group. This is achieved by balancing the partitions between all members in the consumer group so that each partition is assigned to exactly one consumer in the group
assign will manually assign a list of partitions to this consumer. and this method does not use the consumer's group management functionality (where no need of group.id)
The main difference is assign(Collection) will loose the controller over dynamic partition assignment and consumer group coordination
It is also possible for the consumer to manually assign specific partitions (similar to the older "simple" consumer) using assign(Collection). In this case, dynamic partition assignment and consumer group coordination will be disabled.
subscribe
public void subscribe(java.util.Collection<java.lang.String> topics)
The subscribe method Subscribe to the given list of topics to get dynamically assigned partitions. and if the given list of topics is empty, it is treated the same as unsubscribe().
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if one of the following events trigger -
Number of partitions change for any of the subscribed list of topics
Topic is created or deleted
An existing member of the consumer group dies
A new member is added to an existing consumer group via the join API
assign
public void assign(java.util.Collection<TopicPartition> partitions)
The assign method manually assign a list of partitions to this consumer. And if the given list of topic partitions is empty, it is treated the same as unsubscribe().
Manual topic assignment through this method does not use the consumer's group management functionality. As such, there will be no rebalance operation triggered when group membership or cluster and topic metadata change.

I'd like to add some useful information specifically to a consumer without a group.id. There is no default to this property (given no framework shenanigans - KafkaClient lib + Java). It's not official, but they're typically called a free consumer. a free consumer doesn't subscribe to topics, so it's required to assign topic partitions.
As noted above, the concepts of automatic partition assignment, rebalancing, offset persistence, partition exclusivity, consumer heartbeating and failure detection / liveness (all the things that are gifted with a consumer group) are thrown out the window with these free consumers. As such, it's up to the client (you) to keep track of any state the app has in relation to kafka, and that includes keeping track of offsets (a Map, for instance). This is because a free consumer doesn't commit their offsets to Kafka, and usually your own storage mechanism is used.

Kafka consumer - what's the relation of consumer processes and threads with topic partitions

I have been working with Kafka lately and have bit of confusion regarding the consumers under a consumer group. The center of the confusion is whether to implement consumers as processes or threads. For this question, assume I am using the high level consumer.
Let's consider a scenario that I have experimented with. In my topic there are 2 partitions (for simplicity let's assume replication factor is just 1). I created a consumer (ConsumerConnector) process consumer1 with group group1, then created a topic count map of size 2 and then spawned 2 consumer threads consumer1_thread1 and consumer1_thread2 under that process. It looks like consumer1_thread1 is consuming partition 0 and consumer1_thread2 is consuming partition 1. Is this behaviour always deterministic? Below is the code snippet. Class TestConsumer is my consumer thread class.
...
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(2));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
executor = Executors.newFixedThreadPool(2);
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.submit(new TestConsumer(stream, threadNumber));
threadNumber++;
}
...
Now, let's consider another scenario (which I haven't experimented but am curious) where I start 2 consumer processes consumer1 and consumer2 both having the same group group1 and each of them is a single threaded process. Now my questions are:
How will the two independent consumer processes (under the same group nevertheless) be related to the partitions in this case ? How is it different from the above single process multi-thread scenario?
In general, how are consumer threads or processes mapped / related to partitions in the topic?
The Kafka documentation does say that each consumer under a consumer group will consume one partition. However, does that refer to a consumer thread (like my above code example) or independent consumer processes?
Is there any subtle thing I am missing here regarding implementing consumers as processes vs threads? Thanks in advance.

A consumer group can have multiple consumer instances running (multiple process with the same group-id). While consuming each partition is consumed by exactly one consumer instance in the group.
E.g. if your topic contains 2 partitions and you start a consumer group group-A with 2 consumer instances then each one of them will be consuming messages from a particular partition of the topic.
If you start the same 2 consumer with different group id group-A & group-B then the message from both partitions of the topic will be broadcast to each one of them. So in that case the consumer instance running under group-A will have messages from both the partitions of the topic, and same is true for group-B as well.
Read more on this on their documentation
EDIT : Based on your comment which says,
I was wondering what is the effective difference between having 2 consumer threads under the same process as opposed to 2 consumer processes (group being the same in both cases)
The consumer group-id is same/global across the cluster. Suppose you have started process-one with 2 threads and then spawn another process (may be in a different machine) with the same groupId having 2 more threads then kafka will add these 2 new threads to consume messages from the topic. So eventually there will be 4 threads responsible for consuming from the same topic. Kafka will then trigger a re-balance to re-assign partitions to threads, so it could happen that for a particular partition which was being consumed by thread T1 of process P1may be allocated to be consumed by thread T2 of process P2. The below few lines are taken from the wiki page
When a new process is started with the same Consumer Group name, Kafka will add that processes' threads to the set of threads available to consume the Topic and trigger a 're-balance'. During this re-balance Kafka will assign available partitions to available threads, possibly moving a partition to another process. If you have a mixture of old and new business logic, it is possible that some messages go to the old logic.

The main design decision for opting for multiple consumer group instances with the same id vs a single consumer group instance is resiliency. For example if you have a single consumer with two threads then if this machine goes down you loose all consumers. If you have two separate consumer groups with the same id, each on different hosts then they can survive failure. Ideally each consumer group should have two threads in the above, therefore if one host goes down the other consumer group uses its dormant thread to take up the other partition. Indeed it is always desirable to have more threads than partitions to cover this factor.
You can run each consumer group on different hosts. With a single consumer group for a given name/id it will only ever run on a single host as it manages all its threads in a single runtime environment.
Kafka has an algorithm to determine which threads/consumer groups reads the various topic partitions. Kafka tries to evenly distribute these in a resilient fashion. When a consumer group fails, it enables other threads in other groups to read the given partition.
Refers to a single thread in the consumer group. If there are more threads than partitions then some of them will just remain dormant until other threads fail to offer resiliancy.
The preference relates to resilience. So with multiple consumer groups setup with the same id I can run on multiple hosts making my application tolerant to failure.

Thanks for the detailed answer from #user2720864, but I think the re-allocation case #user2720864 mentioned in the answer is not correct => one partition cannot be consumed by two consumers.
When there are more consumers (compared to the partitions), each partition will be exclusively allocated to one consumer only while the leftover consumers will stay lazy only until some working consumers being dead or being removed from the group.
Based on the Kafka Consumers document:
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
And also its API specification at "Consumer Groups and Topic Subscriptions" section:
This is achieved by balancing the partitions between all members in the consumer group so that each partition is assigned to exactly one consumer in the group.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.