Can we lose messages in Kafka Streams if we add new partitions?

Can we lose messages in Kafka Streams if we add new partitions? - java

Say for example, I have 4 partitions.
When a message msg1 of key 101 is put into partition 1 (out of 4) and is not consumed yet. Meanwhile a new partition is added making total of 5 partitions.
Then the next message msg2 of key 101, goes to 4th partition (say for example) because the hash(101)%no_of_partitions=4.
Now, in the streams API, whenever a message is consumed by its key, the partition 4 will be accessed for the key because that is the partition it gets when it computes the hash(101)%no_of_partitions and therefore it gets the msg2 of key 101 in partition 4.
Now, what about the msg1 of key 101 in partition 1? Is it consumed at all?

You won't loose data, however, depending on your application, adding partitions might not supported and will break your application.
You can add partitions only, if you application is stateless. If your application is stateful, your application will most likely break and die with an exception.
Also note, that Kafka Streams assumes, that input data is partitioned by key. Thus, if the partitioning is changed, even if the application does not break, it will most likely compute an incorrect result, because adding a partition violated the partitioning assumption.
One way to approach this issue is, to reset your application (cf ). However, this implies that you loose your current application state. Note, that resetting will not address the problem about incorrect partitioning though and your application might compute incorrect results. To guard agains the partitioning problem, you could insert a dummy map() operation that only forward the data after you read data from a topic, because this will result in data repartitioning if required and thus fix the key-based partitioning.

The msg1 of key 101 in partition 1 will be consumed.
In Kafka Streams, you do not "consume a message by its key". Every message in every partition will be consumed. If someone should filter on the key, it would be in the code of the Kafka Stream App.

It will be consumed, but order is not guaranteed. Be sure that application logic is idempotent. One possible solution is to go through intermediate topic with more partitions. KStream#through will help you to produce and to consume with a single instruction. The method does exactly the same thing and returns a KStream. In pseudo code:
.stream(...)
// potential key transformation
.through("inner_topic_with_more_partitions")
.toTable(accountMateriazer)

Related

How does Kafka Consumer Consume from Multiple assigned Partition

tl;dr; I am trying to understand how a single consumer that is assigned multiple partitions handles consuming records for reach partition.
For example:
Completely processes a single partition before moving to the next.
Process a chunk of available records from each partition every time.
Process a batch of N records from first available partitions
Process a batch of N records from partitions in round-robin rotation
I found the partition.assignment.strategy configuration for Ranged or RoundRobin Assignors but this only determines how consumers are assigned partitions not how it consumes from the partitions it is assigned to.
I started digging into the KafkaConsumer source and
#poll() lead me to the #pollForFetches()
#pollForFetches() then lead me to fetcher#fetchedRecords() and fetcher#sendFetches()
This just lead me to try to follow along the entire Fetcher class all together and maybe it is just late or maybe I just didn't dig in far enought but I am having trouble untangling exactly how a consumer will process multiple assigned partitions.
Background
Working on a data pipeline backed by Kafka Streams.
At several stages in this pipeline as records are processed by different Kafka Streams applications the stream is joined to compacted topics feed by external data sources that provide the required data that will be augmented in the records before continuing to the next stage in processing.
Along the way there are several dead letter topics where the records could not be matched to external data sources that would have augmented the record. This could be because the data is just not available yet (Event or Campaign is not Live yet) or it it is bad data and will never match.
The goal is to republish records from the dead letter topic when ever new augmented data is published so that we can match previously unmatched records from the dead letter topic in order to update them and send them down stream for additional processing.
Records have potentially failed to match on several attempts and could have multiple copies in the dead letter topic so we only want to reprocess existing records (before latest offset at the time the application starts) as well as records that were sent to the dead letter topic since the last time the application ran (after the previously saved consumer group offsets).
It works well as my consumer filters out any records arriving after the application has started, and my producer is managing my consumer group offsets by committing the offsets as part of the publishing transaction.
But I want to make sure that I will eventually consume from all partitions as I have ran into an odd edge case where unmatached records get reprocessed and land in the same partition as before in the dead letter topic only to get filtered out by the consumer. And though it is not getting new batches of records to process there are partitions that have not been reprocessed yet either.
Any help understanding how a single consumer processes multiple assigned partitions would be greatly appreciated.

You were on the right tracks looking at Fetcher as most of the logic is there.
First as the Consumer Javadoc mentions:
If a consumer is assigned multiple partitions to fetch data from, it
will try to consume from all of them at the same time, effectively
giving these partitions the same priority for consumption.
As you can imagine, in practice, there are a few things to take into account.
Each time the consumer is trying to fetch new records, it will exclude partitions for which it already has records awaiting (from a previous fetch). Partitions that already have a fetch request in-flight are also excluded.
When fetching records, the consumer specifies fetch.max.bytes and max.partition.fetch.bytes in the fetch request. These are used by the brokers to respectively determine how much data to return in total and per partition. This is equally applied to all partitions.
Using these 2 approaches, by default, the Consumer tries to consume from all partitions fairly. If that's not the case, changing fetch.max.bytes or max.partition.fetch.bytes usually helps.
In case, you want to prioritize some partitions over others, you need to use pause() and resume() to manually control the consumption flow.

Kafka Streams does not increment offset by 1 when producing to topic

I have implemented a simple Kafka Dead letter record processor.
It works perfectly when using records produced from the Console producer.
However I find that our Kafka Streams applications do not guarantee that producing records to the sink topics that the offsets will be incremented by 1 for each record produced.
Dead Letter Processor Background:
I have a scenario where records may be received before all data required to process it is published.
When records are not matched for processing by the streams app they are move to a Dead letter topic instead of continue to flow down stream. When new data is published we dump the latest messages from the Dead letter topic back in to the stream application's source topic for reprocessing with the new data.
The Dead Letter processor:
At the start of the run application records the ending offsets of each partition
The ending offsets marks the point to stop processing records for a given Dead Letter topic to avoid infinite loop if reprocessed records return to Dead Letter topic.
Application resumes from the last Offsets produced by the previous run via consumer groups.
Application is using transactions and KafkaProducer#sendOffsetsToTransaction to commit the last produced offsets.
To track when all records in my range are processed for a topic's partition my service compares its last produced offset from the producer to the the consumers saved map of ending offsets. When we reach the ending offset the consumer pauses that partition via KafkaConsumer#pause and when all partitions are paused (meaning they reached the saved Ending offset)then calls it exits.
The Kafka Consumer API States:
Offsets and Consumer Position
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5.
The Kafka Producer API references the next offset is always +1 as well.
Sends a list of specified offsets to the consumer group coordinator, and also marks those offsets as part of the current transaction. These offsets will be considered committed only if the transaction is committed successfully. The committed offset should be the next message your application will consume, i.e. lastProcessedMessageOffset + 1.
But you can clearly see in my debugger that the records consumed for a single partition are anything but incremented 1 at a time...
I thought maybe this was a Kafka configuration issue such as max.message.bytes but none really made sense.
Then I thought perhaps it is from joining but didn't see any way that would change the way the producer would function.
Not sure if it is relevant or not but all of our Kafka applications are using Avro and Schema Registry...
Should the offsets always increment by 1 regardless of method of producing or is it possible that using Kafka streams API does not offer the same guarantees as the normal Producer Consumer clients?
Is there just something entirely that I am missing?

It is not an official API contract that message offsets are increased by one, even if the JavaDocs indicate this (it seems that the JavaDocs should be updated).
If you don't use transactions, you get either at-least-once semantics or no guarantees (some call this at-most-once semantics). For at-least-once, records might be written twice and thus, offsets for two consecutive messages are not really increased by one as the duplicate write "consumes" two offsets.
If you use transactions, each commit (or abort) of a transaction writes a commit (or abort) marker into the topic -- those transactional markers also "consume" one offset (this is what you observe).
Thus, in general you should not rely on consecutive offsets. The only guarantee you get is, that each offset is unique within a partition.

I know that knowing offset of messages can be useful. However, Kafka will only guarantee that the offset of a message-X would be greater than the last message(X-1)'s offset. BTW an ideal solution should not be based on offset calculations.
Under the hood, kafka producer may try to resend messages. Also, if a broker goes down then re-balancing may occur. Exactly-once-semantics may append an additional message. Therefore, offset of your message may change if any of above events occur.
Kafka may add additional messages for internal purpose to the topic. But Kafka's consumer API might be discarding those internal messages. Therefore, you can only see your messages and your message's offsets might not necessarily increment by 1.

Kafka offset management on Kafka topic as well as local database

I want to manage offset in Kafka topic as well as the database so that if I want to reprocess in the queue after a certain point I can. How can I proceed on this? Thanks in advance.

Given a PartitionInfo you should be able to tell your consumer to seekToBeginning or seek to an offset in that partition.
A ConsumerRecord knows it's topic, partition and offset. You could record these facts in a database.
But the catch here is if your topics are partitioned. Your data then will be chronological for that category. So if you have two partitions and partition essentially by last name, the name changes for the first half of the alphabet will be sequential, and the second half will be sequential, but it's non obvious how to get a single chronological view of the name changes across the system.
However, if you recorded partition and offset for a particular change in the database, you could seek to that partition and offset and reprocess the stream from that point.
(This becomes irrelevant if you only have one partition, but it's something to think about when/if your topic or streaming architecture needs multiple partitions)
Stepping back from the actual question into theory, I'm not really sure why you would want to do this, as consumer groups will record your committed offset to Kafka itself, thus if your stream processing app crashes you'll be able to pick up from where you left off without worry. This message committing either happens automatically, if you set the enable.auto.commit property, or you can control this manually if you call commitSync() on the consumer. Or you're trying to use an immutable data store (Kafka) as one would a mutable store, but that's just a bit of pure speculation based on the fact that you're not really descriptive with why you want to do the thing you want to do.

Why don't I see any output from the Kafka Streams reduce method?

Given the following code:
KStream<String, Custom> stream =
builder.stream(Serdes.String(), customSerde, "test_in");
stream
.groupByKey(Serdes.String(), customSerde)
.reduce(new CustomReducer(), "reduction_state")
.print(Serdes.String(), customSerde);
I have a println statement inside the apply method of the Reducer, which successfully prints out when I expect the reduction to take place. However, the final print statement shown above displays nothing. likewise if I use a to method rather than print, I see no messages in the destination topic.
What do I need after the reduce statement to see the result of the reduction? If one value is pushed to the input I don't expect to see anything. If a second value with the same key is pushed I expect the reducer to apply (which it does) and I also expect the result of the reduction to continue to the next step in the processing pipeline. As described I'm not seeing anything in subsequent steps of the pipeline and I don't understand why.

As of Kafka 0.10.1.0 all aggregation operators use an internal de-duplication cache to reduce the load of the result KTable changelog stream. For example, if you count and process two records with same key directly after each other, the full changelog stream would be <key:1>, <key:2>.
With the new caching feature, the cache would receive <key:1> and store it, but not send it downstream right away. When <key:2> is computed, it replace the first entry of the cache. Depending on the cache size, number of distinct key, throughput, and your commit interval, the cache sends entries downstream. This happens either on cache eviction for a single key entry or as a complete flush of the cache (sending all entries downstream). Thus, the KTable changelog might only show <key:2> (because <key:1> got de-duplicated).
You can control the size of the cache via Streams configuration parameter StreamConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. If you set the value to zero, you disable caching completely and the KTable changelog will contain all updates (effectively providing pre 0.10.0.0 behavior).
Confluent documentation contains a section explaining the cache in more detail:
http://docs.confluent.io/current/streams/architecture.html#record-caches
http://docs.confluent.io/current/streams/developer-guide.html#streams-developer-guide-memory-management

Effective strategy to avoid duplicate messages in apache kafka consumer

I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?

The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.

This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon

I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.

There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once

Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.