I'm reading Kafka documentation about consumers and faced the following message consumption definition:
Our topic is divided into a set of totally ordered partitions, each of
which is consumed by exactly one consumer within each subscribing
consumer group at any given time. This means that the position of
a consumer in each partition is just a single integer, the offset
of the next message to consume.
I interpreted the wording as follows:
A consumer group reads data from a topic consisting of a number of partitions. Then each consumer from the group is assigned with some subset of partitions that do not overlap with other consumer's partitions from the group.
Consider the following case:
A consumer group GRP consisting of 2 consumers C1 and C2 reads data from a topic TPC consisting of 2 partitions P1 and P2.
QUESTION: If at some point C1 reads from P1 and C2 reads from P2 can it be rebalanced so that C1 starts reading from P2 and C2 from P1. If so under which condition may that happen?
It does not contradict to the quote above.
I see a few things to be discussed in your question and comment.
Your interpretation of the quoted paragraph is correct.
Question "If so under which condition may that happen?":
Yes, this scenario can happen. A change in the assignment of a consumer to a TopicPartition is mainly triggered through a rebalancing. A consumer rebalance will be triggered in the following cases:
Consumer rebalances are initiated when
A Consumer leaves the Consumer group (either by failing to send a timely heartbeat or by explicitly requesting to leave)
A new Consumer joins the Consumer Group
A Consumer changes its Topic subscription
The Consumer Group notices a change to the Topic metadata for any subscribed Topic
(e.g. an increase in the number of Partitions)
[Source: Training Material of Confluent Kafka Developer]
Keep in mind, that during a Rebalance all consumers are paused.
Your comment "C1 read some message from P1 without commiting offset. Then it loses the connection to Kafka and processes the message succesfully. At the same time the other consumer C3 is created and assigned to the P1 reading the same message."
I see this scenario unrelated to a consumer rebalance, as your consumer C1 could just die after processing the data but before committing the back to Kafka. Now, if you restart the consumer C1 it will read the same messages again because it did not yet commit them.
This is called "at-least-once" delivery semantics and is different to "at-most-once" semantics when you have e.g. auto.commit enabled. I guess you are looking for the "holy grail" in distributed systems which is "exactly-once-semantics" :)
For this to achieve you need to consider the entire application from Kafka to the sink of your application. If the output of your application is not idempotent you are likely not able to achieve exactly-once semantics (EOS). But if your output sink e.g. is Kafka again you actually can achieve EOS.
Related
tl;dr; I am trying to understand how a single consumer that is assigned multiple partitions handles consuming records for reach partition.
For example:
Completely processes a single partition before moving to the next.
Process a chunk of available records from each partition every time.
Process a batch of N records from first available partitions
Process a batch of N records from partitions in round-robin rotation
I found the partition.assignment.strategy configuration for Ranged or RoundRobin Assignors but this only determines how consumers are assigned partitions not how it consumes from the partitions it is assigned to.
I started digging into the KafkaConsumer source and
#poll() lead me to the #pollForFetches()
#pollForFetches() then lead me to fetcher#fetchedRecords() and fetcher#sendFetches()
This just lead me to try to follow along the entire Fetcher class all together and maybe it is just late or maybe I just didn't dig in far enought but I am having trouble untangling exactly how a consumer will process multiple assigned partitions.
Background
Working on a data pipeline backed by Kafka Streams.
At several stages in this pipeline as records are processed by different Kafka Streams applications the stream is joined to compacted topics feed by external data sources that provide the required data that will be augmented in the records before continuing to the next stage in processing.
Along the way there are several dead letter topics where the records could not be matched to external data sources that would have augmented the record. This could be because the data is just not available yet (Event or Campaign is not Live yet) or it it is bad data and will never match.
The goal is to republish records from the dead letter topic when ever new augmented data is published so that we can match previously unmatched records from the dead letter topic in order to update them and send them down stream for additional processing.
Records have potentially failed to match on several attempts and could have multiple copies in the dead letter topic so we only want to reprocess existing records (before latest offset at the time the application starts) as well as records that were sent to the dead letter topic since the last time the application ran (after the previously saved consumer group offsets).
It works well as my consumer filters out any records arriving after the application has started, and my producer is managing my consumer group offsets by committing the offsets as part of the publishing transaction.
But I want to make sure that I will eventually consume from all partitions as I have ran into an odd edge case where unmatached records get reprocessed and land in the same partition as before in the dead letter topic only to get filtered out by the consumer. And though it is not getting new batches of records to process there are partitions that have not been reprocessed yet either.
Any help understanding how a single consumer processes multiple assigned partitions would be greatly appreciated.
You were on the right tracks looking at Fetcher as most of the logic is there.
First as the Consumer Javadoc mentions:
If a consumer is assigned multiple partitions to fetch data from, it
will try to consume from all of them at the same time, effectively
giving these partitions the same priority for consumption.
As you can imagine, in practice, there are a few things to take into account.
Each time the consumer is trying to fetch new records, it will exclude partitions for which it already has records awaiting (from a previous fetch). Partitions that already have a fetch request in-flight are also excluded.
When fetching records, the consumer specifies fetch.max.bytes and max.partition.fetch.bytes in the fetch request. These are used by the brokers to respectively determine how much data to return in total and per partition. This is equally applied to all partitions.
Using these 2 approaches, by default, the Consumer tries to consume from all partitions fairly. If that's not the case, changing fetch.max.bytes or max.partition.fetch.bytes usually helps.
In case, you want to prioritize some partitions over others, you need to use pause() and resume() to manually control the consumption flow.
I am new with Kafka Java API and I am working on consuming records from a particular Kafka topic.
I understand that I can use method subscribe() to start polling records from the topic. Kafka also provides method assign() if I want to start polling records from selected partitions of the topics.
I want to understand if this is the only difference between the two?
Yes subscribe need group.id because each consumer in a group will dynamically assigned to partitions for list of topics provided in subscribe method and each partition can be consumed by one consumer thread in that group. This is achieved by balancing the partitions between all members in the consumer group so that each partition is assigned to exactly one consumer in the group
assign will manually assign a list of partitions to this consumer. and this method does not use the consumer's group management functionality (where no need of group.id)
The main difference is assign(Collection) will loose the controller over dynamic partition assignment and consumer group coordination
It is also possible for the consumer to manually assign specific partitions (similar to the older "simple" consumer) using assign(Collection). In this case, dynamic partition assignment and consumer group coordination will be disabled.
subscribe
public void subscribe(java.util.Collection<java.lang.String> topics)
The subscribe method Subscribe to the given list of topics to get dynamically assigned partitions. and if the given list of topics is empty, it is treated the same as unsubscribe().
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if one of the following events trigger -
Number of partitions change for any of the subscribed list of topics
Topic is created or deleted
An existing member of the consumer group dies
A new member is added to an existing consumer group via the join API
assign
public void assign(java.util.Collection<TopicPartition> partitions)
The assign method manually assign a list of partitions to this consumer. And if the given list of topic partitions is empty, it is treated the same as unsubscribe().
Manual topic assignment through this method does not use the consumer's group management functionality. As such, there will be no rebalance operation triggered when group membership or cluster and topic metadata change.
I'd like to add some useful information specifically to a consumer without a group.id. There is no default to this property (given no framework shenanigans - KafkaClient lib + Java). It's not official, but they're typically called a free consumer. a free consumer doesn't subscribe to topics, so it's required to assign topic partitions.
As noted above, the concepts of automatic partition assignment, rebalancing, offset persistence, partition exclusivity, consumer heartbeating and failure detection / liveness (all the things that are gifted with a consumer group) are thrown out the window with these free consumers. As such, it's up to the client (you) to keep track of any state the app has in relation to kafka, and that includes keeping track of offsets (a Map, for instance). This is because a free consumer doesn't commit their offsets to Kafka, and usually your own storage mechanism is used.
I am working on implementing a Kafka based solution to our application.
As per the Kafka documentation, what i understand is one consumer in a consumer group (which is a thread) is internally mapped to one partition in the subscribed topic.
Let's say i have a topic with 40 partitions and i have a high level consumer running in 4 instances. I do not want one instance to consume the same messages consumed by another instance. But if one instance goes down, the other three instances should be able to process all the messages.
Should i go for same consumer group with 10 threads per instance?
- Stackoverflow says same consumer group between the instances act as traditional synchronous queue mechanism
In Apache Kafka why can't there be more consumer instances than partitions?
Or Should i go for different consumer group per instance?
Using simple consumer or low level consumer gives control over the partition but then if one instance goes down, the other three instances would not process the messages from the partitions consumed in first instance
First to explain the concept of Consumers & Consumer Groups,
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group.
The records will be effectively load balanced over the consumer instances in a consumer group. If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Now to answer your questions,
1. I do not want one instance to consume the same messages consumed by another instance. But if one instance goes down, the other three instances should be able to process all the messages.
This is possible by default in Kafka architecture. You just have to label all the 4 instances with the same consumer group name.
2. Should i go for same consumer group with 10 threads per instance ?
Doing this will assign each thread a kafka partition from which it will consume data, which is optimal. Reducing the number of threads will load balance the record distribution among the consumer instances and MAY overload some of the consumer instances.
3. In Apache Kafka why can't there be more consumer instances than partitions?
In Kafka, a partition can be assigned only to one consumer instance. Thus, creating more consumer instances than partitions will lead to idle consumers who will not be consuming any records from kafka.
4. Should i go for different consumer group per instance?
No. This will lead to duplication of the records, as every record will be sent to all the instances, as they are from different consumer groups.
Hope this clarifies your doubts.
There are few things to note when designing your Kafka echo system:
Consumer is essentially a thread and you do not want multiple thread trying to change your offset mark. That's why the consumer system should be designed as one consumer one thread.
Offset commits, there a delicate balance between how frequently you want to perform offset commits. If the frequency is higher then it will have an adverse effect on performance of your system (Zk will be the bottleneck). If the frequency is two low then you may risk duplicate messages.
In Kafka you have both ways to do competing-consumers and publish-subscribe patterns:
competing consumers : it's possible putting consumers inside the same consumer group. So that each partition is accessible by only one consumer (of course a consumer can read more than one partition). It means that you can't have more consumers than partitions in a consumer group, because the other consumers will be idle without being assigned any partition. Of course if one consumer in the consumer group goes down, one of the idle consumer will take the partition.
publish subscribe : if you have different consumer groups, all consumers in different consumer groups will receive same messages. Inside the consumer group then, the above pattern will be applied.
I have been working with Kafka lately and have bit of confusion regarding the consumers under a consumer group. The center of the confusion is whether to implement consumers as processes or threads. For this question, assume I am using the high level consumer.
Let's consider a scenario that I have experimented with. In my topic there are 2 partitions (for simplicity let's assume replication factor is just 1). I created a consumer (ConsumerConnector) process consumer1 with group group1, then created a topic count map of size 2 and then spawned 2 consumer threads consumer1_thread1 and consumer1_thread2 under that process. It looks like consumer1_thread1 is consuming partition 0 and consumer1_thread2 is consuming partition 1. Is this behaviour always deterministic? Below is the code snippet. Class TestConsumer is my consumer thread class.
...
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(2));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
executor = Executors.newFixedThreadPool(2);
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.submit(new TestConsumer(stream, threadNumber));
threadNumber++;
}
...
Now, let's consider another scenario (which I haven't experimented but am curious) where I start 2 consumer processes consumer1 and consumer2 both having the same group group1 and each of them is a single threaded process. Now my questions are:
How will the two independent consumer processes (under the same group nevertheless) be related to the partitions in this case ? How is it different from the above single process multi-thread scenario?
In general, how are consumer threads or processes mapped / related to partitions in the topic?
The Kafka documentation does say that each consumer under a consumer group will consume one partition. However, does that refer to a consumer thread (like my above code example) or independent consumer processes?
Is there any subtle thing I am missing here regarding implementing consumers as processes vs threads? Thanks in advance.
A consumer group can have multiple consumer instances running (multiple process with the same group-id). While consuming each partition is consumed by exactly one consumer instance in the group.
E.g. if your topic contains 2 partitions and you start a consumer group group-A with 2 consumer instances then each one of them will be consuming messages from a particular partition of the topic.
If you start the same 2 consumer with different group id group-A & group-B then the message from both partitions of the topic will be broadcast to each one of them. So in that case the consumer instance running under group-A will have messages from both the partitions of the topic, and same is true for group-B as well.
Read more on this on their documentation
EDIT : Based on your comment which says,
I was wondering what is the effective difference between having 2 consumer threads under the same process as opposed to 2 consumer processes (group being the same in both cases)
The consumer group-id is same/global across the cluster. Suppose you have started process-one with 2 threads and then spawn another process (may be in a different machine) with the same groupId having 2 more threads then kafka will add these 2 new threads to consume messages from the topic. So eventually there will be 4 threads responsible for consuming from the same topic. Kafka will then trigger a re-balance to re-assign partitions to threads, so it could happen that for a particular partition which was being consumed by thread T1 of process P1may be allocated to be consumed by thread T2 of process P2. The below few lines are taken from the wiki page
When a new process is started with the same Consumer Group name, Kafka will add that processes' threads to the set of threads available to consume the Topic and trigger a 're-balance'. During this re-balance Kafka will assign available partitions to available threads, possibly moving a partition to another process. If you have a mixture of old and new business logic, it is possible that some messages go to the old logic.
The main design decision for opting for multiple consumer group instances with the same id vs a single consumer group instance is resiliency. For example if you have a single consumer with two threads then if this machine goes down you loose all consumers. If you have two separate consumer groups with the same id, each on different hosts then they can survive failure. Ideally each consumer group should have two threads in the above, therefore if one host goes down the other consumer group uses its dormant thread to take up the other partition. Indeed it is always desirable to have more threads than partitions to cover this factor.
You can run each consumer group on different hosts. With a single consumer group for a given name/id it will only ever run on a single host as it manages all its threads in a single runtime environment.
Kafka has an algorithm to determine which threads/consumer groups reads the various topic partitions. Kafka tries to evenly distribute these in a resilient fashion. When a consumer group fails, it enables other threads in other groups to read the given partition.
Refers to a single thread in the consumer group. If there are more threads than partitions then some of them will just remain dormant until other threads fail to offer resiliancy.
The preference relates to resilience. So with multiple consumer groups setup with the same id I can run on multiple hosts making my application tolerant to failure.
Thanks for the detailed answer from #user2720864, but I think the re-allocation case #user2720864 mentioned in the answer is not correct => one partition cannot be consumed by two consumers.
When there are more consumers (compared to the partitions), each partition will be exclusively allocated to one consumer only while the leftover consumers will stay lazy only until some working consumers being dead or being removed from the group.
Based on the Kafka Consumers document:
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
And also its API specification at "Consumer Groups and Topic Subscriptions" section:
This is achieved by balancing the partitions between all members in the consumer group so that each partition is assigned to exactly one consumer in the group.
I am testing how Kafka works on multi-consumers with high level Java APIS.
Created 1 topic with 5 partitions, 1 producer, and 2 consumer(C1, C2). Each consumer will have only one thread, and partition.assignment.strategy set to range.
C1 start, it claim all the partition. Then C2 start, ZK will trigger a rebalance. After that, C1 will claim (0, 1, 2), C2 will claim (3, 4). It works well util now.
Then I check the messages received by C1, I hope that messages will just from partitions (0, 1, 2). But in my log file, I can find message from all the partitions, and that happened also in C2. It just like that partition.assignment.strategy set to roundrobin. Is this how Kafka dispatch message. Or that must be some mistake?
First of all just to correct your approach, Its always better to have same number of consumers as many partition you have for a topic. In this way each Consumer will claim only one partition and will stick to that only and you will get exactly data from that partition and also in ordered way not from others.
Now to answer your question why you are getting data from almost all the partitions in both the Consumer because you have less consumers as compare to partitions in this case each Consumer thread will try to access partition.
There is also a theory that if you have greater number of Consumers as compared to number of partitions per topic then there is a possibility that some of the Consumer will never gets any data.