Kafka Streams localstore partitions assignment unbalanced

Kafka Streams localstore partitions assignment unbalanced - java

First off all, sorry if my terminology is not precise I am very new to kafka and i have read as most as i could.
We have a service which uses kafkastreams, kafka version: 2.3.1.
The stream app has a stream topology which reads from a "topicA", performs a convertion and publishes into another topic "topicB" which then is consumed by another stream of the topology and aggregates it using a Ktable ( localstore ). A listener publishes the ktable changes into another topic.
The topics has 24 partitions.
We have 2 instances of this service in different machines with 4 stream threads each.
The problem is, the partitions that use local store are assigned all to the same instance.
Hence the disk usage, rebalance, performance is awful.
Also something unexpected to me, if I check the group assignments on the Kafka broker i see:
(Removed other partitions for readability )
GROUP CONSUMER-ID HOST CLIENT-ID #PARTITIONS ASSIGNMENT
fj.TheAggregation.TST.V1.PERF fj.TheAggregation.TST.V1.PERF-6898e899-7722-421a-8841-f8e45b074981-StreamThread-3-consumer-c089baaa-343b-484f-add6-aca12572e2a5 10.11.200.115/10.11.200.115 fj.TheAggregation.TST.V1.PERF-6898e899-7722-421a-8841-f8e45b074981-StreamThread-3-consumer 54 fj.TheAggregationDocument.TST.V1.PERF(4,8,12,16,20)
fj.TheAggregation.TST.V1.PERF fj.TheAggregation.TST.V1.PERF-6898e899-7722-421a-8841-f8e45b074981-StreamThread-2-consumer-f5e2d4e3-feee-4778-8ab8-ec4dd770541a 10.11.200.115/10.11.200.115 fj.TheAggregation.TST.V1.PERF-6898e899-7722-421a-8841-f8e45b074981-StreamThread-2-consumer 54 fj.TheAggregationDocument.TST.V1.PERF(5,9,13,17,21)
fj.TheAggregation.TST.V1.PERF fj.TheAggregation.TST.V1.PERF-0733344b-bd8d-40d6-ad07-4fc93de76cf2-StreamThread-4-consumer-63371f35-118a-44e0-bc9b-d403fb59384d 10.11.200.114/10.11.200.114 fj.TheAggregation.TST.V1.PERF-0733344b-bd8d-40d6-ad07-4fc93de76cf2-StreamThread-4-consumer 54 fj.TheAggregationDocument.TST.V1.PERF(2)
fj.TheAggregation.TST.V1.PERF fj.TheAggregation.TST.V1.PERF-0733344b-bd8d-40d6-ad07-4fc93de76cf2-StreamThread-1-consumer-714f0fee-b001-4b16-8b5b-6ab8935becfd 10.11.200.114/10.11.200.114 fj.TheAggregation.TST.V1.PERF-0733344b-bd8d-40d6-ad07-4fc93de76cf2-StreamThread-1-consumer 54 fj.TheAggregationDocument.TST.V1.PERF(0)
fj.TheAggregation.TST.V1.PERF fj.TheAggregation.TST.V1.PERF-0733344b-bd8d-40d6-ad07-4fc93de76cf2-StreamThread-2-consumer-d14e2e20-9aad-4a20-a295-83621a76b099 10.11.200.114/10.11.200.114 fj.TheAggregation.TST.V1.PERF-0733344b-bd8d-40d6-ad07-4fc93de76cf2-StreamThread-2-consumer 54 fj.TheAggregationDocument.TST.V1.PERF(1)
fj.TheAggregation.TST.V1.PERF fj.TheAggregation.TST.V1.PERF-6898e899-7722-421a-8841-f8e45b074981-StreamThread-4-consumer-14f390d9-f4f4-4e70-8e8d-62a79427c4e6 10.11.200.115/10.11.200.115 fj.TheAggregation.TST.V1.PERF-6898e899-7722-421a-8841-f8e45b074981-StreamThread-4-consumer 54 fj.TheAggregationDocument.TST.V1.PERF(7,11,15,19,23)
fj.TheAggregation.TST.V1.PERF fj.TheAggregation.TST.V1.PERF-6898e899-7722-421a-8841-f8e45b074981-StreamThread-1-consumer-57d2f85b-50f8-4649-8080-bbaaa6ea500f 10.11.200.115/10.11.200.115 fj.TheAggregation.TST.V1.PERF-6898e899-7722-421a-8841-f8e45b074981-StreamThread-1-consumer 54 fj.TheAggregationDocument.TST.V1.PERF(6,10,14,18,22)
fj.TheAggregation.TST.V1.PERF fj.TheAggregation.TST.V1.PERF-0733344b-bd8d-40d6-ad07-4fc93de76cf2-StreamThread-3-consumer-184f3a99-1159-44d7-84c6-e7aa70c484c0 10.11.200.114/10.11.200.114 fj.TheAggregation.TST.V1.PERF-0733344b-bd8d-40d6-ad07-4fc93de76cf2-StreamThread-3-consumer 54 fj.TheAggregationDocument.TST.V1.PERF(3)
so each stream service has 54 partitions assigned in total, however they are not evenly assigned. Also if i check the local store on each instance i see that the stream ktable are all in the same node, even though the broker states that some of the partition's are assigned to another instance. So the data provided by the broker does not seem to match the streamapp state.
Is there a way to ensure that GroupLeader assigns partitions evenly?
I would expect to have some way to specify that or assign some kind of "weight" to each stream so the GroupLeader is able to distribute resources intensive streams evenly among the service instances or at least not so unbalanced.
Btw, is there some kafka users group recommended to ask this kind of things?
Thanks

There was a lot of improvements to the streams assignor in 2.6 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams) you can read about them here.
I don't know if they will fix your problem but it should help. It does treat stateful task like ktables differently and should load them better.
If you cannot upgrade from 2.3.1 you might try different names. You might just be getting unlucky hashes.

Related

Kafka RoundRobin partitioner not distributing messages to 4 partitions

I have 4 brokers and 4 partitions but when I try to push 4 messages with null key it’s not saving as round robin.I was expecting message will save it in each partition once.
Using kafka-clients 5.5* for getting KafkaProducer and it looks like version after 5.4* ,
If a key is not provided, behavior is Confluent Platform version-dependent:
In Confluent Platform versions 5.4.x and later, the partition is assigned with awareness to batching. If a batch of records is not full and has not yet been sent to the broker, it will select the same partition as a prior record. Partitions for newly created batches are assigned randomly. For more information, see KIP-480: Sticky Partitioner and the related Confluent blog post.
In Confluent Platform versions prior to 5.4.x, the partition is assigned in a round robin method, starting at a random partition.
https://docs.confluent.io/platform/current/clients/producer.html
Is my understanding correct or not ?

A new partitioner (StickyPartitioner) was introduced in Kafka version 2.4 for improving the way in which data is sent to the partitions by the producer.
Basically, now it batches the data first and then does a round robin instead of doing round robin for each and every record.
For more details, you cam refer to the below link. It explains everything is detail.
https://www.confluent.io/blog/5-things-every-kafka-developer-should-know/#tip-2-new-sticky-partitioner

Understanding kafka streams partition assignor

I have two topics, one with 3 partitions and one with 48.
Initially i used the default assignor but i got some problems when a consumer(pod in kubernetes) crashed.
What happened was that when the pod came up again it reassigned the partition from the topic with 3 partitions and 0 from the topic with 48.
The two pods that did not crash got assigned 16 and 32 partitions from the topic with 48 partitions.
I've fixed this by using a round robin partition assignor but now i don't feel confident in how the partitions is distributed since i'm using kstream-kstream joins and for that we need to guarantee that the consumers are assigned to the same partition for all the consumer e.g. C1:(t1:p0, t2:p0) C2(t1:p1, t2:p1) etc..
One thing i thought of was that i could rekey the events coming in so they will repartition and then i might be able to guarantee this?
Or maybe i don't understand how the default partitioning work.. im confused

Kafka Streams does not allow to use a custom partition assignor. If you set one yourself, it will be overwritten with the StreamsPartitionAssignor [1]. This is needed to ensure that -- if possible -- partitions are re-assigned to the same consumers (a.k.a. stickiness) during rebalancing. Stickiness is important for Kafka Streams to be able to reuse states stores at consumer side as much as possible. If a partition is not reassigned to the same consumer, state stores used within this consumer need to be recreated from scratch after rebalancing.
[1] https://github.com/apache/kafka/blob/9bd0d6aa93b901be97adb53f290b262c7cf1f175/streams/src/main/java/org/apache/kafka/streams/StreamsConfig.java#L989

Can we lose messages in Kafka Streams if we add new partitions?

Say for example, I have 4 partitions.
When a message msg1 of key 101 is put into partition 1 (out of 4) and is not consumed yet. Meanwhile a new partition is added making total of 5 partitions.
Then the next message msg2 of key 101, goes to 4th partition (say for example) because the hash(101)%no_of_partitions=4.
Now, in the streams API, whenever a message is consumed by its key, the partition 4 will be accessed for the key because that is the partition it gets when it computes the hash(101)%no_of_partitions and therefore it gets the msg2 of key 101 in partition 4.
Now, what about the msg1 of key 101 in partition 1? Is it consumed at all?

You won't loose data, however, depending on your application, adding partitions might not supported and will break your application.
You can add partitions only, if you application is stateless. If your application is stateful, your application will most likely break and die with an exception.
Also note, that Kafka Streams assumes, that input data is partitioned by key. Thus, if the partitioning is changed, even if the application does not break, it will most likely compute an incorrect result, because adding a partition violated the partitioning assumption.
One way to approach this issue is, to reset your application (cf ). However, this implies that you loose your current application state. Note, that resetting will not address the problem about incorrect partitioning though and your application might compute incorrect results. To guard agains the partitioning problem, you could insert a dummy map() operation that only forward the data after you read data from a topic, because this will result in data repartitioning if required and thus fix the key-based partitioning.

The msg1 of key 101 in partition 1 will be consumed.
In Kafka Streams, you do not "consume a message by its key". Every message in every partition will be consumed. If someone should filter on the key, it would be in the code of the Kafka Stream App.

It will be consumed, but order is not guaranteed. Be sure that application logic is idempotent. One possible solution is to go through intermediate topic with more partitions. KStream#through will help you to produce and to consume with a single instruction. The method does exactly the same thing and returns a KStream. In pseudo code:
.stream(...)
// potential key transformation
.through("inner_topic_with_more_partitions")
.toTable(accountMateriazer)

storm topology: one to many (random)

I'm using the KafkaSpout spout to read from all (6) partitions on a kafka topic. The first bolt in the topology has to convert the byte stream into a struct (via IDL definition), lookup a value in a db and pass these values to a second bolt which writes it all into cassandra.
There are several issues occurring:
Many fail(s) from the kafka spout.
The first bolt reports "capacity" of > 2.0 from the storm ui.
I've tried to increase the parallelism but it appears that storm will only accept 1:1 from the kafkaspout to the first bolt. I'm guessing that #1 is a result of timeouts from the first bolt.
What I want to do: have the kafkaspouts (limited to 1 / kafka partition) able to send their bits to a random first bolt so that I can run many more of these than the # of spouts. The first and second bolts would be 1:1 but the spout to first bolt should be 1:many.
Currently I'm using the LocalOrShuffleGrouping to connect between spout->bolt->bolt.
Edit:
(Re)reading the storms docs I see this passage:
Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
Yet when I look at the load on the executors for my first bolt I see everything concentrated on 6 of them - seemingly ignoring the other 24.
I'm missing some large clue here.

Effective strategy to avoid duplicate messages in apache kafka consumer

I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?

The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.

This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon

I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.

There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once

Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.