I insert a record to Kafka producer and then call consumer which returns the inserted element (and previously inserted elements) then I call the customer again (without inserting new record with producer) the consumer does not return any records.
As far as I know the record should remain in topic. I have no idea how to set acknowledge to false in properties. Is this issue related to acknowledgment?
The consumer will only get "new" messages. If you have previously read until the end of the topic and there is nothing new, you won't get anything.
If you want to read from the beginning of the topic again, you have to "rewind" the consumer (or create a new one).
I have no idea how to set acknowledge to false in properties(is this issue related to acknowledgment?)
If you use consumer groups, the offsets for the consumer (how far it is read each topic partition) will be stored within Kafka (or Zookeeper for older versions). You can control this by acknowledging (or not) the receipt of messages. However, this only has an effect when the consumer is restarted, not for an already running instance.
If you don't use consumer groups, this offset tracking is purely done within the consumer instance itself.
I’m using Quarkus Kafka consumer. And I need to know to which partitions my consumer has been assigned by Kafka broker.
Any listener that I can use, just like the one that Kafka client provide.
Otherwise how can I assign a specific partition in each of the nodes of my cluster?
From the Quarkus docs, I think you can use rebalance listener.
It should be called, because the initial assignment of partitions to your client (from no partitions to some partitions) can be considered as rebalance too.
The listener is invoked every time the consumer topic/partition assignment changes. For example, when the application starts, it invokes the partitionsAssigned callback with the initial set of topics/partitions associated with the consumer. If, later, this set changes, it calls the partitionsRevoked and partitionsAssigned callbacks again, so you can implement custom logic.
I am new with Kafka Java API and I am working on consuming records from a particular Kafka topic.
I understand that I can use method subscribe() to start polling records from the topic. Kafka also provides method assign() if I want to start polling records from selected partitions of the topics.
I want to understand if this is the only difference between the two?
Yes subscribe need group.id because each consumer in a group will dynamically assigned to partitions for list of topics provided in subscribe method and each partition can be consumed by one consumer thread in that group. This is achieved by balancing the partitions between all members in the consumer group so that each partition is assigned to exactly one consumer in the group
assign will manually assign a list of partitions to this consumer. and this method does not use the consumer's group management functionality (where no need of group.id)
The main difference is assign(Collection) will loose the controller over dynamic partition assignment and consumer group coordination
It is also possible for the consumer to manually assign specific partitions (similar to the older "simple" consumer) using assign(Collection). In this case, dynamic partition assignment and consumer group coordination will be disabled.
public void subscribe(java.util.Collection<java.lang.String> topics)
The subscribe method Subscribe to the given list of topics to get dynamically assigned partitions. and if the given list of topics is empty, it is treated the same as unsubscribe().
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if one of the following events trigger -
Number of partitions change for any of the subscribed list of topics
Topic is created or deleted
An existing member of the consumer group dies
A new member is added to an existing consumer group via the join API
public void assign(java.util.Collection<TopicPartition> partitions)
The assign method manually assign a list of partitions to this consumer. And if the given list of topic partitions is empty, it is treated the same as unsubscribe().
Manual topic assignment through this method does not use the consumer's group management functionality. As such, there will be no rebalance operation triggered when group membership or cluster and topic metadata change.
I'd like to add some useful information specifically to a consumer without a group.id. There is no default to this property (given no framework shenanigans - KafkaClient lib + Java). It's not official, but they're typically called a free consumer. a free consumer doesn't subscribe to topics, so it's required to assign topic partitions.
As noted above, the concepts of automatic partition assignment, rebalancing, offset persistence, partition exclusivity, consumer heartbeating and failure detection / liveness (all the things that are gifted with a consumer group) are thrown out the window with these free consumers. As such, it's up to the client (you) to keep track of any state the app has in relation to kafka, and that includes keeping track of offsets (a Map, for instance). This is because a free consumer doesn't commit their offsets to Kafka, and usually your own storage mechanism is used.
We're using ActiveMQ (5.14.5).
We have a single producer, and multiple consumers on the same queue.
From time to time we set JMSXGroupID to group several messages together to be consumed on a single consumer. This works as expected.
In parallel, the producer continues to send non-grouped messages (i.e. without JMSXGroupID)
The problem:
We noticed that once a consumer was selected to process a specific group, it no longer gets the non-grouped messages. Even if it is completely idle. The non-grouped messages are always sent to the other consumers.
The rogue consumer returns to consume non-grouped messages only after we close the group that was assigned to it (by setting JMSXGroupSeq=-1).
Is this a normal behavior? We expected that non-grouped messages will continue to be delivered in the same round-robin fashion as usual, to all consumers.
We were unable to find a clear reference to this in ActiveMQ documentation.
There's a bit of a no-win situation for the message broker here. If there are active message groups in play, the the broker has to assume that further messages will be produced that fall into those groups. So a message consumer that has become bound to a particular group needs to remain available to consumer later messages of that group, rather than ungrouped messages. After all, an ungrouped message can be handled elsewhere, while a grouped message can't.
However, we also want to have a fair-ish distribution of messages between consumers. So it makes sense that a consumer that is bound to a group, or groups, could take some work when it is idle.
But how do we know it is idle? What happens if a consumer takes a bunch of ungrouped messages (and don't forget the default pre-fetch behaviour), and then new messages arrive that match its specific group?
The fact that closing a group restores the "group consumer" to default behaviour suggests to me that this is not a bug, but a deliberate attempt to make a reasonable compromise in a tricky situation. It seems reasonable to me to ask for a feature to be added, where "group consumers" can take part in ungrouped workload, but I would be inclined to see that as an enhancement.
Just my $0.02, of course.
I have a REST service, lets call it MDD, which has one kafka consumer. When I FIRST start the rest service, another service tells MDD's consumer to subscribe to a specific topic, everything seems to go fine.
Then the service tells MDD's consumer to subscribe to another topic. The way I am doing it right now is via consumer.assign() method. Basically, if a new topic is introduced to which the consumer is not assigned to, I assign this new topic to the consumer. So the one consumer is now assigned to 2 different topics.
This consumer polls the messages and deposits them into HDFS.
Now what I have noticed, is when the subscription for the 2nd topic comes in, sometimes I get the error about failing to append to file in HDFS and when I looked at the logs, it was trying to append some data that should not have been appended till later on.
For example, data to kafka comes in this order A, B, C. When MDD is done appending A to HDFS, it tries to append C (rather than B) and simultaneously tries to append B as well. Also another note, no data is coming from the first topic at this point, only data from second topic is streaming in. So currently, only one kafka topic has data streaming in at any given time.
Anyone have any idea what could be going on? Is there potential of some thread issues being created when I assign ONE consumer to multiple topics? Because everything seems to go fine when the consumer is assigned to ONE topic but as soon as its assigned to more than ONE topic, I get failed to append to file in HDFS because some other writer already owns the lease. This error does not happen frequently, just very randomly.
Also would a recommended fix be every time a new topic is created, create a new kafka consumer?
It's definitely valid and doable to have only one consumer read messages from multiple topics. The problem you ran into is because of the fact that Kafka does not currently support both using manual partition assignment(with KafkaConsumer#assign) and group management(with KafkaConsumer#subscribe).
To support subscribing newly-created topics, you could try to invoke KafkaConsumer#subscribe to which a regular expression is passed, matching all the newly-created topics.
I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?
The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.
This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon
I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.
There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once
Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2