I am new with Kafka Java API and I am working on consuming records from a particular Kafka topic.
I understand that I can use method subscribe() to start polling records from the topic. Kafka also provides method assign() if I want to start polling records from selected partitions of the topics.
I want to understand if this is the only difference between the two?
Yes subscribe need group.id because each consumer in a group will dynamically assigned to partitions for list of topics provided in subscribe method and each partition can be consumed by one consumer thread in that group. This is achieved by balancing the partitions between all members in the consumer group so that each partition is assigned to exactly one consumer in the group
assign will manually assign a list of partitions to this consumer. and this method does not use the consumer's group management functionality (where no need of group.id)
The main difference is assign(Collection) will loose the controller over dynamic partition assignment and consumer group coordination
It is also possible for the consumer to manually assign specific partitions (similar to the older "simple" consumer) using assign(Collection). In this case, dynamic partition assignment and consumer group coordination will be disabled.
subscribe
public void subscribe(java.util.Collection<java.lang.String> topics)
The subscribe method Subscribe to the given list of topics to get dynamically assigned partitions. and if the given list of topics is empty, it is treated the same as unsubscribe().
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if one of the following events trigger -
Number of partitions change for any of the subscribed list of topics
Topic is created or deleted
An existing member of the consumer group dies
A new member is added to an existing consumer group via the join API
assign
public void assign(java.util.Collection<TopicPartition> partitions)
The assign method manually assign a list of partitions to this consumer. And if the given list of topic partitions is empty, it is treated the same as unsubscribe().
Manual topic assignment through this method does not use the consumer's group management functionality. As such, there will be no rebalance operation triggered when group membership or cluster and topic metadata change.
I'd like to add some useful information specifically to a consumer without a group.id. There is no default to this property (given no framework shenanigans - KafkaClient lib + Java). It's not official, but they're typically called a free consumer. a free consumer doesn't subscribe to topics, so it's required to assign topic partitions.
As noted above, the concepts of automatic partition assignment, rebalancing, offset persistence, partition exclusivity, consumer heartbeating and failure detection / liveness (all the things that are gifted with a consumer group) are thrown out the window with these free consumers. As such, it's up to the client (you) to keep track of any state the app has in relation to kafka, and that includes keeping track of offsets (a Map, for instance). This is because a free consumer doesn't commit their offsets to Kafka, and usually your own storage mechanism is used.
Related
I am trying to understand the concept of having multiple consumers in the same consumer group to consume from the same topic.
Selecting RangeAssignor as partition.assignment.strategy.
If I have multiple instances deployed, then each instance/consumer should have it's own clientId.
What I don't understand is that all those instances should be exactly the same. But how and when does these client.id get assigned to each instance/consumer?
Kafka Consumer Application should be configured with a mandatory config called group.id. This config is responsible for grouping the consumers (running across multiple instances). Kafka will group all the consumers based on this config and assign each consumer to a topic partition based on the selected strategy. At most one consumer gets assigned to a given partition to guarantee the ordering, though a consumer can read from multiple partitions.
client.id config is optional and if set it will allow you to easily correlate requests on the Kafka broker with the client instance which made it.
This will help for better monitoring and debugging purposes.
I’m using Quarkus Kafka consumer. And I need to know to which partitions my consumer has been assigned by Kafka broker.
Any listener that I can use, just like the one that Kafka client provide.
Otherwise how can I assign a specific partition in each of the nodes of my cluster?
Regards
From the Quarkus docs, I think you can use rebalance listener.
It should be called, because the initial assignment of partitions to your client (from no partitions to some partitions) can be considered as rebalance too.
https://quarkus.io/guides/kafka#consumer-rebalance-listener
The listener is invoked every time the consumer topic/partition assignment changes. For example, when the application starts, it invokes the partitionsAssigned callback with the initial set of topics/partitions associated with the consumer. If, later, this set changes, it calls the partitionsRevoked and partitionsAssigned callbacks again, so you can implement custom logic.
I'm reading Kafka documentation about consumers and faced the following message consumption definition:
Our topic is divided into a set of totally ordered partitions, each of
which is consumed by exactly one consumer within each subscribing
consumer group at any given time. This means that the position of
a consumer in each partition is just a single integer, the offset
of the next message to consume.
I interpreted the wording as follows:
A consumer group reads data from a topic consisting of a number of partitions. Then each consumer from the group is assigned with some subset of partitions that do not overlap with other consumer's partitions from the group.
Consider the following case:
A consumer group GRP consisting of 2 consumers C1 and C2 reads data from a topic TPC consisting of 2 partitions P1 and P2.
QUESTION: If at some point C1 reads from P1 and C2 reads from P2 can it be rebalanced so that C1 starts reading from P2 and C2 from P1. If so under which condition may that happen?
It does not contradict to the quote above.
I see a few things to be discussed in your question and comment.
Your interpretation of the quoted paragraph is correct.
Question "If so under which condition may that happen?":
Yes, this scenario can happen. A change in the assignment of a consumer to a TopicPartition is mainly triggered through a rebalancing. A consumer rebalance will be triggered in the following cases:
Consumer rebalances are initiated when
A Consumer leaves the Consumer group (either by failing to send a timely heartbeat or by explicitly requesting to leave)
A new Consumer joins the Consumer Group
A Consumer changes its Topic subscription
The Consumer Group notices a change to the Topic metadata for any subscribed Topic
(e.g. an increase in the number of Partitions)
[Source: Training Material of Confluent Kafka Developer]
Keep in mind, that during a Rebalance all consumers are paused.
Your comment "C1 read some message from P1 without commiting offset. Then it loses the connection to Kafka and processes the message succesfully. At the same time the other consumer C3 is created and assigned to the P1 reading the same message."
I see this scenario unrelated to a consumer rebalance, as your consumer C1 could just die after processing the data but before committing the back to Kafka. Now, if you restart the consumer C1 it will read the same messages again because it did not yet commit them.
This is called "at-least-once" delivery semantics and is different to "at-most-once" semantics when you have e.g. auto.commit enabled. I guess you are looking for the "holy grail" in distributed systems which is "exactly-once-semantics" :)
For this to achieve you need to consider the entire application from Kafka to the sink of your application. If the output of your application is not idempotent you are likely not able to achieve exactly-once semantics (EOS). But if your output sink e.g. is Kafka again you actually can achieve EOS.
tl;dr; I am trying to understand how a single consumer that is assigned multiple partitions handles consuming records for reach partition.
For example:
Completely processes a single partition before moving to the next.
Process a chunk of available records from each partition every time.
Process a batch of N records from first available partitions
Process a batch of N records from partitions in round-robin rotation
I found the partition.assignment.strategy configuration for Ranged or RoundRobin Assignors but this only determines how consumers are assigned partitions not how it consumes from the partitions it is assigned to.
I started digging into the KafkaConsumer source and
#poll() lead me to the #pollForFetches()
#pollForFetches() then lead me to fetcher#fetchedRecords() and fetcher#sendFetches()
This just lead me to try to follow along the entire Fetcher class all together and maybe it is just late or maybe I just didn't dig in far enought but I am having trouble untangling exactly how a consumer will process multiple assigned partitions.
Background
Working on a data pipeline backed by Kafka Streams.
At several stages in this pipeline as records are processed by different Kafka Streams applications the stream is joined to compacted topics feed by external data sources that provide the required data that will be augmented in the records before continuing to the next stage in processing.
Along the way there are several dead letter topics where the records could not be matched to external data sources that would have augmented the record. This could be because the data is just not available yet (Event or Campaign is not Live yet) or it it is bad data and will never match.
The goal is to republish records from the dead letter topic when ever new augmented data is published so that we can match previously unmatched records from the dead letter topic in order to update them and send them down stream for additional processing.
Records have potentially failed to match on several attempts and could have multiple copies in the dead letter topic so we only want to reprocess existing records (before latest offset at the time the application starts) as well as records that were sent to the dead letter topic since the last time the application ran (after the previously saved consumer group offsets).
It works well as my consumer filters out any records arriving after the application has started, and my producer is managing my consumer group offsets by committing the offsets as part of the publishing transaction.
But I want to make sure that I will eventually consume from all partitions as I have ran into an odd edge case where unmatached records get reprocessed and land in the same partition as before in the dead letter topic only to get filtered out by the consumer. And though it is not getting new batches of records to process there are partitions that have not been reprocessed yet either.
Any help understanding how a single consumer processes multiple assigned partitions would be greatly appreciated.
You were on the right tracks looking at Fetcher as most of the logic is there.
First as the Consumer Javadoc mentions:
If a consumer is assigned multiple partitions to fetch data from, it
will try to consume from all of them at the same time, effectively
giving these partitions the same priority for consumption.
As you can imagine, in practice, there are a few things to take into account.
Each time the consumer is trying to fetch new records, it will exclude partitions for which it already has records awaiting (from a previous fetch). Partitions that already have a fetch request in-flight are also excluded.
When fetching records, the consumer specifies fetch.max.bytes and max.partition.fetch.bytes in the fetch request. These are used by the brokers to respectively determine how much data to return in total and per partition. This is equally applied to all partitions.
Using these 2 approaches, by default, the Consumer tries to consume from all partitions fairly. If that's not the case, changing fetch.max.bytes or max.partition.fetch.bytes usually helps.
In case, you want to prioritize some partitions over others, you need to use pause() and resume() to manually control the consumption flow.
I am working on implementing a Kafka based solution to our application.
As per the Kafka documentation, what i understand is one consumer in a consumer group (which is a thread) is internally mapped to one partition in the subscribed topic.
Let's say i have a topic with 40 partitions and i have a high level consumer running in 4 instances. I do not want one instance to consume the same messages consumed by another instance. But if one instance goes down, the other three instances should be able to process all the messages.
Should i go for same consumer group with 10 threads per instance?
- Stackoverflow says same consumer group between the instances act as traditional synchronous queue mechanism
In Apache Kafka why can't there be more consumer instances than partitions?
Or Should i go for different consumer group per instance?
Using simple consumer or low level consumer gives control over the partition but then if one instance goes down, the other three instances would not process the messages from the partitions consumed in first instance
First to explain the concept of Consumers & Consumer Groups,
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group.
The records will be effectively load balanced over the consumer instances in a consumer group. If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.
Now to answer your questions,
1. I do not want one instance to consume the same messages consumed by another instance. But if one instance goes down, the other three instances should be able to process all the messages.
This is possible by default in Kafka architecture. You just have to label all the 4 instances with the same consumer group name.
2. Should i go for same consumer group with 10 threads per instance ?
Doing this will assign each thread a kafka partition from which it will consume data, which is optimal. Reducing the number of threads will load balance the record distribution among the consumer instances and MAY overload some of the consumer instances.
3. In Apache Kafka why can't there be more consumer instances than partitions?
In Kafka, a partition can be assigned only to one consumer instance. Thus, creating more consumer instances than partitions will lead to idle consumers who will not be consuming any records from kafka.
4. Should i go for different consumer group per instance?
No. This will lead to duplication of the records, as every record will be sent to all the instances, as they are from different consumer groups.
Hope this clarifies your doubts.
There are few things to note when designing your Kafka echo system:
Consumer is essentially a thread and you do not want multiple thread trying to change your offset mark. That's why the consumer system should be designed as one consumer one thread.
Offset commits, there a delicate balance between how frequently you want to perform offset commits. If the frequency is higher then it will have an adverse effect on performance of your system (Zk will be the bottleneck). If the frequency is two low then you may risk duplicate messages.
In Kafka you have both ways to do competing-consumers and publish-subscribe patterns:
competing consumers : it's possible putting consumers inside the same consumer group. So that each partition is accessible by only one consumer (of course a consumer can read more than one partition). It means that you can't have more consumers than partitions in a consumer group, because the other consumers will be idle without being assigned any partition. Of course if one consumer in the consumer group goes down, one of the idle consumer will take the partition.
publish subscribe : if you have different consumer groups, all consumers in different consumer groups will receive same messages. Inside the consumer group then, the above pattern will be applied.