Effective strategy to avoid duplicate messages in apache kafka consumer

Effective strategy to avoid duplicate messages in apache kafka consumer - java

I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?

The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.

This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon

I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.

There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once

Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2

Related

Preserve messages during unprecedented JVM crash in a high throughput system

I am building a high volume system that will be processing up to a hundred million messages everyday. I have a microservice that is reading from a Kafka topic and doing some basic processing on them before forwarding them to the next microservice.
Kafka Topic -> Normalizer Microservice -> Ordering Microservice
Below is what the processing would look like:
Normalizer would be concurrently picking up messages from the Kafka topic.
Normalizer would read the messages from the topic and post them to an in-memory seda queue from where the message would be subsequently picked up, normalized and validated.
This normalization, validation and processing is expected to take around 1 second per message. Within this one second, the message will be stored to the database and will become persistent in the system.
My concern is that during this processing, if a message has been already read from the topic and posted to the seda queue and has either
not yet been picked up from the seda queue or,
has been picked up from the seda queue and is currently processing and has not yet been persisted to the database
and the Normalizer JVM crashes or is force-killed (kill -9), how do I ensure that I do NOT lose the message?
It is critical that I do NOT drop/lose any messages and even in case of a crash/failure, I should be able to retain the message such that I can trigger re-processing of that message if required.
One naïve approach that comes to mind is to push the message to a cache (which will be a very fast operation).
Read from topic -> Push to cache -> Push to seda queue
Needless to say, the problem still exists, it just makes it less probable that I will lose the message. Also, this is certainly not the smartest solution out there.
Please share your thoughts on how I can design this system such that I can preserve messages on my side once the messages have been read off of the Kafka topic even in the event of the Normalizer JVM crashing.

How to solve the queue multi-consuming concurrency problem?

Our program is using Queue.
Multiple consumers are processing messages.
Consumers do the following:
Receive on or off status message from the Queue.
Get the latest status from the repository.
Compare the state of the repository and the state received from the message.
If the on/off status is different, update the data. (At this time, other related data are also updated.)
Assuming that this process is handled by multiple consumers, the following problems are expected.
Producer sends messages 1: on, 2: off, and 3: on.
Consumer A receives message #1 and stores message #1 in the storage because there is no latest data.
Consumer A receives message #2.
At this time, consumer B receives message #3 at the same time.
Consumers A and B read the latest data from the storage at the same time (message 1).
Consumer B finishes processing first. Don't update the repository as the on/off state is unchanged.(1: on, 3: on)
Then consumer A finishes the processing. The on/off state has changed, so it processes and saves the work. (1: on, 2: off)
In normal case, the latest data remaining in the DB should be on.
(This is because the message was sent in the order of on -> off -> on.)
However, according to the above scenario, off remains the latest data.
Is there any good way to solve this problem?
For reference, the queue we use is using AWS Amazon MQ and the storage is using AWS dynamoDB. And using Spring Boot.

The fundamental problem here is that you need to consume these "status" messages in order, but you're using concurrent consumers which leads to race-conditions and out-of-order message processing. In short, your basic architecture using concurrent consumers is causing this problem.
You could possibly work up some kind of solution in the database with timestamps as suggested in the comments, but that would be extra work for the clients and extra data stored in the database that isn't strictly necessary.
The simplest way to solve the problem is to just consume the messages serially rather than concurrently. There are a handful of different ways to do this, e.g.:
Define just 1 consumer for the queue with the "status" messages.
Use ActiveMQ's "exclusive consumer" feature to ensure that only one consumer receives messages.
Use message groups to group all the "status" messages together to ensure they are processed serially (i.e. in order).

Is RabbitMQ or Kafka message queue a 1:1 messaging system?

As mentioned in the answer,
A message queue is a one-way pipe: one process writes to the queue, and another reads the data in the order
SysV message queue is one example
So, my understanding is,
one message queue is used by two processes, where one process(producer) insert an item in the queue and another process(consumer) consumes the item from the queue
1) Is RabbitMQ or Kafka message queue a 1:1 messaging system? used by only two processes, where one process writes and other process reads......
2) after the consumer consume the item, does the item get deleted? If no, why do we need queue data structure? Why not just shared memory?

Kafka is not strictly 1:1 messaging system. Multiple producers can write into a topic and multiple consumers can read from it. Moreover, in Kafka, multiple consumers can be assigned same or different consumer groups. Every message is consumed by only one consumer from every consumer group (load balancing) and all consumer groups receive a copy of every message (of course, if they are subscribed to corresponding topics and no messages are lost). A good description of this process can be found in this article: Scalability of Kafka Messaging using Consumer Groups.
In Kafka all messages are persisted on the disk and stored until the compaction reaps it, or the retention.ms passes, or the log size is exceeded. That's a very high-level point of view and there are a lot of nuances here. Like: the messages are stored in segments, every segment contains multiple messages. When the retention period passes for a message, it is not removed from the segment at that moment, instead Kafka waits until all messages in that segment are expired and delete the whole segment at once. Also, retention could come before the log exceeds the maximum size or vice versa: the log can exceed the size even before the retention period passes. And so on. Just read the docs and pay attention to topics about "log cleaner" and "retention".
After the Kafka consumer reads the message it is neither compacted, nor expired. So, it's not removed from the log and stays there. It also means that every message could be re-read by a consumer if needed (until it is deleted completely). It can be useful if some of your consumers went offline for some reason and were not able to process the messages as they come in. It also allows interesting features like transaction replays and so on. Persistence is one of the Kafka's features.
Shared memory? Well, strictly speaking shared memory is only allowed inside a single process. So you can't generally use "shared memory" when you need to access it from different processes. And there is absolutely no way to have "shared memory" when you app runs on multiple hosts. However, there are in-memory brokers. Like Redis can be used as a message broker, and it's all in-memory. However, if such a broker restarts for some reason you lose everything. Speaking about Redis: it has two persistence configurations specifically to handle the restarts.
I am not sure about RabbitMQ, but it probably deletes messages after the consumer acknowledged them by default. So it's closer to 1:1 mental model. However, RabbitMQ employs disk persistence as well.

Kafka Streams does not increment offset by 1 when producing to topic

I have implemented a simple Kafka Dead letter record processor.
It works perfectly when using records produced from the Console producer.
However I find that our Kafka Streams applications do not guarantee that producing records to the sink topics that the offsets will be incremented by 1 for each record produced.
Dead Letter Processor Background:
I have a scenario where records may be received before all data required to process it is published.
When records are not matched for processing by the streams app they are move to a Dead letter topic instead of continue to flow down stream. When new data is published we dump the latest messages from the Dead letter topic back in to the stream application's source topic for reprocessing with the new data.
The Dead Letter processor:
At the start of the run application records the ending offsets of each partition
The ending offsets marks the point to stop processing records for a given Dead Letter topic to avoid infinite loop if reprocessed records return to Dead Letter topic.
Application resumes from the last Offsets produced by the previous run via consumer groups.
Application is using transactions and KafkaProducer#sendOffsetsToTransaction to commit the last produced offsets.
To track when all records in my range are processed for a topic's partition my service compares its last produced offset from the producer to the the consumers saved map of ending offsets. When we reach the ending offset the consumer pauses that partition via KafkaConsumer#pause and when all partitions are paused (meaning they reached the saved Ending offset)then calls it exits.
The Kafka Consumer API States:
Offsets and Consumer Position
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5.
The Kafka Producer API references the next offset is always +1 as well.
Sends a list of specified offsets to the consumer group coordinator, and also marks those offsets as part of the current transaction. These offsets will be considered committed only if the transaction is committed successfully. The committed offset should be the next message your application will consume, i.e. lastProcessedMessageOffset + 1.
But you can clearly see in my debugger that the records consumed for a single partition are anything but incremented 1 at a time...
I thought maybe this was a Kafka configuration issue such as max.message.bytes but none really made sense.
Then I thought perhaps it is from joining but didn't see any way that would change the way the producer would function.
Not sure if it is relevant or not but all of our Kafka applications are using Avro and Schema Registry...
Should the offsets always increment by 1 regardless of method of producing or is it possible that using Kafka streams API does not offer the same guarantees as the normal Producer Consumer clients?
Is there just something entirely that I am missing?

It is not an official API contract that message offsets are increased by one, even if the JavaDocs indicate this (it seems that the JavaDocs should be updated).
If you don't use transactions, you get either at-least-once semantics or no guarantees (some call this at-most-once semantics). For at-least-once, records might be written twice and thus, offsets for two consecutive messages are not really increased by one as the duplicate write "consumes" two offsets.
If you use transactions, each commit (or abort) of a transaction writes a commit (or abort) marker into the topic -- those transactional markers also "consume" one offset (this is what you observe).
Thus, in general you should not rely on consecutive offsets. The only guarantee you get is, that each offset is unique within a partition.

I know that knowing offset of messages can be useful. However, Kafka will only guarantee that the offset of a message-X would be greater than the last message(X-1)'s offset. BTW an ideal solution should not be based on offset calculations.
Under the hood, kafka producer may try to resend messages. Also, if a broker goes down then re-balancing may occur. Exactly-once-semantics may append an additional message. Therefore, offset of your message may change if any of above events occur.
Kafka may add additional messages for internal purpose to the topic. But Kafka's consumer API might be discarding those internal messages. Therefore, you can only see your messages and your message's offsets might not necessarily increment by 1.

Do newer versions of Kafka producers still have "producer.type"?

Older versions' doc says it's one of the essential properties.
Newer versions' doc doesn't mention it at all.
Do newer versions of Kafka producers still have producer.type?
Or, new producers are always async, and I should call future.get() to make it sync?

New producers are always async, and you should call future.get() to make it sync. It's not worth making two apis methods when something as simple as adding future.get() gives you basically the same functionality.
From the documentation for send() here
https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
Since the send call is asynchronous it returns a Future for the
RecordMetadata that will be assigned to this record. Invoking get() on
this future will block until the associated request completes and then
return the metadata for the record or throw any exception that
occurred while sending the record.
If you want to simulate a simple blocking call you can call the get()
method immediately:
byte[] key = "key".getBytes();
byte[] value = "value".getBytes();
ProducerRecord<byte[],byte[]> record = new ProducerRecord<byte[],byte[]>("my-topic", key, value);
producer.send(record).get();

Why do you want to make the send() to sync ?
This is a kafka feature to batch message for better throughput.
Asynchronous send
Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.
There is no way to do a send sync because of the api only support the async method, But there is a some configs you can specify to do some work arround.
You could set the batch.size to 0. In this case, the message bacthing is disabled.
However I think you should just leave the batch.size default and set the linger.ms to 0 (this is also default). In this case, if many message come in the same time, they will be batched in one send immediately .
The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out.
And if you want to make sure the message is sent and persisted successfully, you coould set the acks to -1 or 1 and retries to 3 (e.g.)
More info about the producer config, you can refer https://kafka.apache.org/documentation/#producerconfigs

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.