Kafka Streams 2.6, Partition Assignor and Rebalancing Strategy

Kafka Streams 2.6, Partition Assignor and Rebalancing Strategy - java

In my current Kafka version which is 2.6, i am using Streams API and i have a question. When i start a stream, it writes Streams,Admin,Consumer and Produces configs. I noticed something strange that although i provide configuration
streamsConfiguration.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, CooperativeStickyAssignor.class.getName());
like above, i see some different strategies in consumer and stream logs.
Here is consumer logs that shows consumer configs
2021-01-20 15:52:32.611 INFO 111980 --- [alytics.event-4] o.a.k.clients.consumer.ConsumerConfig : ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 500
auto.offset.reset = none
bootstrap.servers = [XXX:9092, XXX:9092, XXX:9092]
check.crcs = true
client.dns.lookup = use_all_dns_ips
client.id = APPID-dd747646-8b51-42b0-8ad9-2fb26435a588-StreamThread-2-restore-consumer
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = null
group.instance.id = null
heartbeat.interval.ms = 25000
interceptor.classes = []
internal.leave.group.on.close = false
internal.throw.on.fetch.stable.offset.unsupported = false
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 1000
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = DEBUG
metrics.sample.window.ms = 30000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.CooperativeStickyAssignor]
but also i saw logs like below
2021-01-20 15:52:32.740 INFO 111980 --- [alytics.event-4] o.a.k.clients.consumer.ConsumerConfig : ConsumerConfig values:
allow.auto.create.topics = false
auto.commit.interval.ms = 500
auto.offset.reset = latest
bootstrap.servers = [XXX:9092, XXX:9092, XXX:9092]
check.crcs = true
client.dns.lookup = use_all_dns_ips
client.id = APPID-dd747646-8b51-42b0-8ad9-2fb26435a588-StreamThread-2-consumer
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
heartbeat.interval.ms = 25000
interceptor.classes = []
internal.leave.group.on.close = false
internal.throw.on.fetch.stable.offset.unsupported = false
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 1000
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = DEBUG
metrics.sample.window.ms = 30000
partition.assignment.strategy = [org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor]
when i check those two consumer logs, i only noticed that their client.id values are different.
I am a bit consufed that did i enable CooperativeStickyAssignor or not ?
what are the differences between these two consumers that causes to use different partition assignment strategy ?
Is it normal that i see different consumer configuration in a same kafka streams application ?
Thank you

The first consumer logs in your question is that of a "restore" consumer which manages state store recovery. You can find the word "restore" in the client id.
The second consumer logs that you showed in your question is that of your own defined consumer. It seems that the strategy that is used by your consumer is "StreamsPartitionAssignor".

Related

Consumer CommitFailedException - time between subsequent calls to poll() was longer than the configured max.poll.interval.ms

I have problem with kafka consumer which from time to time throws exception.
ERROR [*KafkaConsumerWorker] (Thread-125) [] Kafka Consumer thread 235604751 Exception while polling Kafka.: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:820) [kafka-clients-2.3.0.jar:]
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:692) [kafka-clients-2.3.0.jar:]
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1368) [kafka-clients-2.3.0.jar:]
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1330) [kafka-clients-2.3.0.jar:]
at *.kafka.KafkaConsumerWorker.run(KafkaConsumerWorker.java:64) [classes:]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_51]
I can't find why is this happening, because the consumer is not processing any messages, while this exception occurs. These exceptions occurred 2 - 3 times daily.
Some of my consumer configurations are as follow:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [*]
check.crcs = true
client.dns.lookup = default
client.id = 52c94040-05d9-4b57-8006-afcc862f9b62
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = TEST
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 10
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
Implementation:
{
logger.info("Kafka Consumer thread {} start", hashCode());
Consumer<String, Message> consumer = null;
try {
consumer = KafkaConsumerClient.createConsumer();
while (start) {
try {
ConsumerRecords<String, Message> notifications =
consumer.poll(300000);
if (!notifications.isEmpty()) {
//processing.....
}
consumer.commitSync();
} catch (Exception e) {
logger.error("Kafka Consumer thread {} Exception while polling Kafka.", hashCode(), e);
}
}
logger.info("Kafka Consumer thread {} exit", hashCode());
} finally {
if (consumer != null) {
logger.info("Kafka Consumer thread {} closing consumer.", hashCode());
consumer.close();
}
}
}
I know that with this version of the kafka clinet, the heartbeatis sent from another thread which I guess that eliminates that the consumer spent too much time for processing (even that there is nothing to process). I guess that this is something with config timeoutes but can't find which exactly.

Assuming you want to handle records in order, you should append events into an in memory queue from the consumer loop, then hand off that queue object into a completely new dequeue-ing Thread for the processing... logic
The error suggests whatever you're doing there is slow enough to stop and rebalance your consumer
I'd also recommend a higher level library that can handle backpressure such as the Connect / Streams API or Vertx or Smallrye Messaging or Akka Streams

You should set the Duration of Consumer#poll(Duration) to lower than max.poll.interval.ms which is the maximum of time Consumer can stay idle before fetching more records. In Kafka document:
If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member
By the time you you commit your offset, your consumer is already failed, and partition had already revoked, group is rebalancing.

Kafka AdminClient seems to start twice

I hope I'm able to explain this!
I'm starting a dockerized java spring boot application that will connect to a single dockerized Kafka instance.
To do this I have setup a link in the docker-compose file that will allow the application to connect to the kafka docker, named kafka-cluster on port 9092.
When I start the both containers I get an error in the java application saying that it's unable to connect to KafkaAdmin:
[AdminClient clientId=adminclient-2] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
But it's trying to connect to localhost/127.0.0.1.
Further up in the logs, I can see it started a connection to KafkaAdmin twice:
first:
2020-03-25 13:53:15.515 INFO 7 --- [ main] o.a.k.clients.admin.AdminClientConfig : AdminClientConfig values:
bootstrap.servers = [kafka-cluster:9092]
client.dns.lookup = default
client.id =
connections.max.idle.ms = 300000
<more properties>
and then again straight after (but on localhost):
2020-03-25 13:53:15.780 INFO 7 --- [ main] o.a.k.clients.admin.AdminClientConfig : AdminClientConfig values:
bootstrap.servers = [localhost:9092]
client.dns.lookup = default
client.id =
connections.max.idle.ms = 300000
<more properties>
This is the config:
#Bean
public KafkaAdmin kafkaAdmin() {
Map<String, Object> configs = new HashMap<>();
configs.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
return new KafkaAdmin(configs);
}
#Bean
public NewTopic sysCcukCdcAssetsCreate() {
return new NewTopic(newPanelTopic, 1, (short) 1);
}
#Bean
public NewTopic sysCcukCdcAssetsUpdate() {
return new NewTopic(updatedPanelTopic, 1, (short) 1);
}
where bootstrapServers = kafka-cluster:9092
I can't see why it seems like KafkaAdmin has 2 sets of configs but it seems like it is causing the error.
Any guidance or suggestions massively appreciated :)

So! This was a great rubber duck.
Turns out the spring.kafka.bootstrap.servers property hadn't been set in the properties file and it defaults to localhost setting this to kafka-cluster:9092 fixed it :)

Frequent "offset out of range" messages, partitions deserted by consumer

We are running 3 node Kafka 0.10.0.1 cluster. We have a consumer application which has a single consumer group connecting to multiple topics. We are seeing strange behaviour in consumer logs. With these lines
Fetch offset 1109143 is out of range for partition email-4, resetting offset
Fetch offset 952168 is out of range for partition email-7, resetting offset
Fetch offset 945796 is out of range for partition email-5, resetting offset
Fetch offset 950900 is out of range for partition email-0, resetting offset
Fetch offset 953163 is out of range for partition email-3, resetting offset
Fetch offset 1118389 is out of range for partition email-6, resetting offset
Fetch offset 1112177 is out of range for partition email-2, resetting offset
Fetch offset 1109539 is out of range for partition email-1, resetting offset
Some time later we saw these logs
[2018-06-08 19:45:28] :: INFO :: ConsumerCoordinator:333 - Revoking previously assigned partitions [sms-4, sms-3, sms-0, sms-2, sms-1] for group notifications-consumer
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator:381 - (Re-)joining group notifications-consumer
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO :: AbstractCoordinator$1:349 - Successfully joined group notifications-consumer with generation 3063
[2018-06-08 19:45:28] :: INFO :: ConsumerCoordinator:225 - Setting newly assigned partitions [sms-8, sms-7, sms-9, sms-6, sms-5] for group notifications-consumer
I noticed that one of our topics was not seen in the list of Setting newly assigned partitions. Then that topic had no consumers attached to it for 8 hours at least. It's only when someone restarted application it started consuming from that topic. What can be going wrong here?
Here is consumer config
auto.commit.interval.ms = 3000
auto.offset.reset = latest
bootstrap.servers = [x.x.x.x:9092, x.x.x.x:9092, x.x.x.x:9092]
check.crcs = true
client.id =
connections.max.idle.ms = 540000
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = otp-notifications-consumer
heartbeat.interval.ms = 3000
interceptor.classes = null
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 50
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.ms = 50
request.timeout.ms = 305000
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism = GSSAPI
security.protocol = SSL
send.buffer.bytes = 131072
session.timeout.ms = 300000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = /x/x/client.truststore.jks
ssl.truststore.password = [hidden]
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
The topic which went orphan has 10 partitions, retention.ms=1800000, segment.ms=1800000.
Please help.

The offset out of range message you are seeing usually indicates the offset the consumer is at has been deleted on the broker. Upon hitting that the consumer will use auto.offset.reset to restart consuming.
With retention.ms=1800000 (30mins), you are only keeping data for a very short amount of time so it's expected that if you restart the consumer after several hours, the data is gone.

Java Producer not recognizing property "Partitioner.class" property

I am running Cloudera provided Kafka version 0.9.0 and I have written a custom Producer
I am adding the following properties to my ProducerConfig:
Properties props = new Properties();
props.put("bootstrap.servers", brokers);
props.put("acks", "all");
//props.put("metadata.broker.list", this.getBrokers());
//props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("partitioner.class", "com.vikas.MyPartitioner");
However when I am running my program, I am getting this error
16/08/23 18:00:33 INFO producer.ProducerConfig: ProducerConfig values:
value.serializer = class org.apache.kafka.common.serialization.StringSerializer
key.serializer = class org.apache.kafka.common.serialization.StringSerializer
block.on.buffer.full = true
retry.backoff.ms = 100
buffer.memory = 33554432
batch.size = 16384
metrics.sample.window.ms = 30000
metadata.max.age.ms = 300000
receive.buffer.bytes = 32768
timeout.ms = 30000
max.in.flight.requests.per.connection = 5
bootstrap.servers = [server01:9092, server02:9092]
metric.reporters = []
client.id =
compression.type = none
retries = 0
max.request.size = 1048576
send.buffer.bytes = 131072
acks = all
reconnect.backoff.ms = 10
linger.ms = 1
metrics.num.samples = 2
metadata.fetch.timeout.ms = 60000
16/08/23 18:00:33 WARN producer.ProducerConfig: The configuration partitioner.class = null was supplied but isn't a known config.
As per the documentation on http://kafka.apache.org/090/documentation.html#producerconfigs
partitioner,class is a valid property but I am not sure why is kafka complaining about it being an unknown config.

Take a look at this article http://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html it has sample on how to use custom partitioner.
As you can see the error message is
"16/08/23 18:00:33 WARN producer.ProducerConfig: The configuration partitioner.class = null was supplied but isn't a known config." which means Kafka does not understand the key partitioner.class
You might want to set partitioner like this
configProperties.put(ProducerConfig.PARTITIONER_CLASS_CONFIG,"com.vikas.MyPartitioner");

Two instances not doing automatic load balancing in Quartz scheduler on the same machine

Two instances in a clustered JobStore are not working on the same machine inspite of having the correct quartz properties (As given in the docs).
The problem seems that the load balancing is not happening and everytime it's the same instance that picks the job (The one that starts later).
============================================================================
Configure Main Scheduler Properties
org.quartz.scheduler.instanceName = MyClusteredScheduler
org.quartz.scheduler.instanceId = instance1
============================================================================
Configure ThreadPool
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 25
org.quartz.threadPool.threadPriority = 5
============================================================================
Configure JobStore
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class = org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.useProperties = false
org.quartz.jobStore.dataSource = mySQLDS
org.quartz.jobStore.tablePrefix = QRTZ_
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 1000
============================================================================
Configure Datasources
org.quartz.dataSource.mySQLDS.driver = com.mysql.jdbc.Driver
org.quartz.dataSource.mySQLDS.URL = jdbc:mysql://localhost:3306/quartz2
org.quartz.dataSource.mySQLDS.user = quartz2
org.quartz.dataSource.mySQLDS.password = quartz2123
org.quartz.dataSource.mySQLDS.maxConnections = 5
org.quartz.dataSource.mySQLDS.validationQuery=select 0 from dual
The other instance had instance id as instance2.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka Streams 2.6, Partition Assignor and Rebalancing Strategy - java

Related

Consumer CommitFailedException - time between subsequent calls to poll() was longer than the configured max.poll.interval.ms

Kafka AdminClient seems to start twice

Frequent "offset out of range" messages, partitions deserted by consumer

Java Producer not recognizing property "Partitioner.class" property

Two instances not doing automatic load balancing in Quartz scheduler on the same machine

Categories

Resources