We are using 60 Partitions and have 70 consumers(10 more than partitions),There is no consumer Lag in any of the partitions, but rebalancing is happening in loop continuosly.
Processing time of Each kafkaEvent is 10ms
The following is the CommitFailed Exception which i get
org.apache.kafka.common.errors.RebalanceInProgressException: Offset commit cannot be completed since the consumer is undergoing a rebalance for auto partition assignment. You can try completing the rebalance by calling poll() and then retry the operation.
I am using reactor-kafka:1.3.12 , reactor-core:3.4.18 , kafka-clients:3.0.1 , Java 17 version and Tomcat Webserver
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CommentedbootStrapServersUrl");
props.put(ConsumerConfig.GROUP_ID_CONFIG,"user-groupid-1");
props.put(ConsumerConfig.CLIENT_ID_CONFIG,"client-1");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,true);
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,StringDeserializer.class.getName());
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG,360000);
props.put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG,300500);
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG,30500);
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG,300000);
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG,5);
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");
ReceiverOptions<String,String> receiverOptions = ReceiverOptions.<String,String>create(props)
.addAssignListener(partitions ->LOGGER.info("onPartitionAssigned {}",partitions))
.addRevokeListener(partitions -> LOGGER.info("onPartitionRevoked {}",partitions))
.commitRetryInterval(Duration.ofMillis(3000))
.commitBatchSize(3)
.commitInterval(Duration.ofMillis(2000))
.maxDelayRebalance(Duration.ofSeconds(60))
.subscription(Collections.singleton("commentedTopicName"));
Flux<ReceiverRecord<String,String>> inboundFlux = KafkaReceiver.create(receiverOptions)
.receive()
.onErrorContinue((throwable,o) -> {
LOGGER.error("Error while consuming in this pod");
})
.retryWhen(Retry.backoff(3, Duration.ofSeconds(3)).transientErrors(true))
.repeat();
try {
Scheduler schedulers = Schedulers.newBoundedElastic(2,2,"UThread");
inboundFlux
.publishOn(schedulers)
.doOnNext(kafkaEvent -> {
callBusinessLogic(kafkaEvent);
kafkaEvent.receiverOffset().acknowledge();
})
.doOnError(e->e.printStackTrace())
.subscribe();
} catch (Exception e){
e.printStackTrace();
}
Needed help with fixing rebalancing issue, consumers are stable at times , but if a rebalance starts , then it never stops.
Related
I'm using Java 11 and kafka-client 2.0.0.
I'm using the following code to generate a consumer :
public Consumer createConsumer(Properties properties,String regex) {
log.info("Creating consumer and listener..");
Consumer consumer = new KafkaConsumer<>(properties);
ConsumerRebalanceListener listener = new ConsumerRebalanceListener() {
#Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
log.info("The following partitions were revoked from consumer : {}", Arrays.toString(partitions.toArray()));
}
#Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
log.info("The following partitions were assigned to consumer : {}", Arrays.toString(partitions.toArray()));
}
};
consumer.subscribe(Pattern.compile(regex), listener);
log.info("consumer subscribed");
return consumer;
}
}
My poll loop is in a different place in the code :
public <K, V> void startWorking(Consumer<K, V> consumer) {
try {
while (true) {
ConsumerRecords<K, V> records = consumer.poll(600);
if (records.count() > 0) {
log.info("Polled {} records", records.count());
} else {
log.info("polled 0 records.. going to sleep..");
Thread.sleep(200);
}
}
} catch (WakeupException | InterruptedException e) {
log.error("Consumer is shutting down", e);
} finally {
consumer.close();
}
}
When I run the code and use this function, the consumer is created and the log contains the following messages :
Creating consumer and listener..
consumer subscribed
polled 0 records.. going to sleep..
polled 0 records.. going to sleep..
polled 0 records.. going to sleep..
The log doesn't contain any info regarding the partition assignment/revocation.
In addition I'm able to see in the log the properties that the consumer uses (group.id is set) :
2020-07-09 14:31:07.959 DEBUG 7342 --- [ main] o.a.k.clients.consumer.ConsumerConfig : ConsumerConfig values:
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [server1:9092]
check.crcs = true
client.id =
group.id=mygroupid
key.deserializer=..
value.deserializer=..
So I tried to use the kafka-console-consumer with the same configuration in order to consume from one of the topics that the regex(mytopic.*) should catch (in this case I used the topic mytopic-1) :
/usr/bin/kafka-console-consumer.sh --bootstrap-server server1:9092 --topic mytopic-1 --property print.timestamp=true --consumer.config /data/scripts/kafka-consumer.properties --from-begining
I have a poll loop in other part of my code that is timing out every 10m .So the bottom line - the problem is that partitions aren't assigned to the Java consumer. The prints inside the listener never happen and the consumer doesn't have any partitions to listen to.
It seems that I was missing the ssl property in my properties file. Don't forget to specify the security.protocol=ssl if you use ssl. It seems that kafka-client API doesn't throw exception if the Kafka uses ssl and you try to access it without ssl parameter configured.
I have problem with kafka consumer which from time to time throws exception.
ERROR [*KafkaConsumerWorker] (Thread-125) [] Kafka Consumer thread 235604751 Exception while polling Kafka.: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.sendOffsetCommitRequest(ConsumerCoordinator.java:820) [kafka-clients-2.3.0.jar:]
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.commitOffsetsSync(ConsumerCoordinator.java:692) [kafka-clients-2.3.0.jar:]
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1368) [kafka-clients-2.3.0.jar:]
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1330) [kafka-clients-2.3.0.jar:]
at *.kafka.KafkaConsumerWorker.run(KafkaConsumerWorker.java:64) [classes:]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_51]
I can't find why is this happening, because the consumer is not processing any messages, while this exception occurs. These exceptions occurred 2 - 3 times daily.
Some of my consumer configurations are as follow:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = [*]
check.crcs = true
client.dns.lookup = default
client.id = 52c94040-05d9-4b57-8006-afcc862f9b62
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = TEST
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 10
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
Implementation:
{
logger.info("Kafka Consumer thread {} start", hashCode());
Consumer<String, Message> consumer = null;
try {
consumer = KafkaConsumerClient.createConsumer();
while (start) {
try {
ConsumerRecords<String, Message> notifications =
consumer.poll(300000);
if (!notifications.isEmpty()) {
//processing.....
}
consumer.commitSync();
} catch (Exception e) {
logger.error("Kafka Consumer thread {} Exception while polling Kafka.", hashCode(), e);
}
}
logger.info("Kafka Consumer thread {} exit", hashCode());
} finally {
if (consumer != null) {
logger.info("Kafka Consumer thread {} closing consumer.", hashCode());
consumer.close();
}
}
}
I know that with this version of the kafka clinet, the heartbeatis sent from another thread which I guess that eliminates that the consumer spent too much time for processing (even that there is nothing to process). I guess that this is something with config timeoutes but can't find which exactly.
Assuming you want to handle records in order, you should append events into an in memory queue from the consumer loop, then hand off that queue object into a completely new dequeue-ing Thread for the processing... logic
The error suggests whatever you're doing there is slow enough to stop and rebalance your consumer
I'd also recommend a higher level library that can handle backpressure such as the Connect / Streams API or Vertx or Smallrye Messaging or Akka Streams
You should set the Duration of Consumer#poll(Duration) to lower than max.poll.interval.ms which is the maximum of time Consumer can stay idle before fetching more records. In Kafka document:
If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member
By the time you you commit your offset, your consumer is already failed, and partition had already revoked, group is rebalancing.
To limit the batch size when using Spark streaming, I referenced this answer
There is about 50 millions records stocking (about to be consumed) in Kafka.
The topic is with 3 partitions.
zhihu_comment 0 10906153 28668062 17761909 - - -
zhihu_comment 1 10972464 30271728 19299264 - - -
zhihu_comment 2 10906395 28662007 17755612 - - -
My consumer app:
public final class SparkConsumer {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
String brokers = "device1:9092,device2:9092,device3:9092";
String groupId = "spark";
String topics = "zhihu_comment";
// Create context with a certain seconds batch interval
SparkConf sparkConf = new SparkConf().setAppName("TestKafkaStreaming");
sparkConf.set("spark.streaming.backpressure.enabled", "true");
sparkConf.set("spark.streaming.backpressure.initialRate", "10000");
sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "10000");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(10));
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
kafkaParams.put("enable.auto.commit", true);
kafkaParams.put("max.poll.records", "500");
// Create direct kafka stream with brokers and topics
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
// Get the lines, split them into words, count the words and print
JavaDStream<String> lines = messages.map(ConsumerRecord::value);
lines.count().print();
jssc.start();
jssc.awaitTermination();
}
}
I have limited the consuming size of spark streaming, in my case, I set maxRatePerPartition to 10000, which means it consumed 300000 records per batch in my case.
The question is although spark streaming is able to handle records with specific limit, the current offset showing by kafka is not the offset that spark streaming is handling. As the kafka's current offset suddenly goes down to latest offset:
zhihu_comment 0 28700537 28700676 139 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
zhihu_comment 1 30305102 30305224 122 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
zhihu_comment 2 28695033 28695146 113 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
It appears that Spark streaming does not commit the offset in each batch, it commits the latest offset at the beginning when it starts to consume!
Is there any way to make spark streaming commit with each batch?
Spark streaming log, proving the records num it consumed each batch:
20/05/04 22:28:13 INFO scheduler.DAGScheduler: Job 15 finished: print at SparkConsumer.java:65, took 0.012606 s
-------------------------------------------
Time: 1588602490000 ms
-------------------------------------------
300000
20/05/04 22:28:13 INFO scheduler.JobScheduler: Finished job streaming job 1588602490000 ms.0 from job set of time 1588602490000 ms
You need to disable
kafkaParams.put("enable.auto.commit", false);
and rather use
messages.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
// do here some transformations and action on the rdd, typically like:
rdd.foreachPartition(it -> {
it.foreach(row -> ...)
})
// commit messages
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
});
as described in the Spark + Kafka Integration Guide.
You could also use commitSync for synchronous commits.
Set up kafka consumer with this configuration
kafkaconfig:
acks: 1
autoCommit: true
bootstrapServers: example.com:9092
topic: item
groupId: EWok-group
keyDeserializer: org.apache.kafka.common.serialization.StringDeserializer
valueDeserializer: org.apache.kafka.common.serialization.StringDeserializer
maxPollRecords: 1
pollMillisTime: 15
retries: 5
heartBeatInterval: 300
sessionTimeout: 100000
maxPollInterval: 30000
code
while (true) {
try {
ConsumerRecords<String, String> consumerRecords = eWokIntegrationConsumer.poll(Duration.of(kafkaCommConfig.getPollMillisTime(), ChronoUnit.SECONDS));
if (!consumerRecords.isEmpty()) {
LOG.info("Consumed Record Count: {}", consumerRecords.count());
consumerRecords.forEach(record -> {
System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value());
eWokMessageProcessor.onMessage(record.value());
eWokIntegrationConsumer.commitSync();
});
} else {
LOG.info("Polling returned without any records.");
}
} catch (Exception exception) {
LOG.error("Consumer was interrupted. But still continue to poll. Exception:", exception);
eWokIntegrationConsumer.close();
}
}
10000 ms is taking for processing the data which we have received from kafka consumer.but getting
exception saying
java.lang.IllegalStateException: This consumer has already been closed.
Exception Logs
java.lang.IllegalStateException: This consumer has already been closed.
at org.apache.kafka.clients.consumer.KafkaConsumer.acquireAndEnsureOpen(KafkaConsumer.java:2202)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1332)
at org.apache.kafka.clients.consumer.KafkaConsumer.commitSync(KafkaConsumer.java:1298)
Kafka version : kafka-clients-2.0.1
Could you pls suggest any one how should configurations Kafka consume looks like.
I have put System.exit(0) in other place in the source code.That is why consumer has leave the group and mark as closed.
I have remove System.exit(0) from source code.Now it's working fine.
I have a java application with below properties
kafkaProperties = new Properties();
kafkaProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBrokersList);
kafkaProperties.put(ConsumerConfig.GROUP_ID_CONFIG, consumerGroupName);
kafkaProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
kafkaProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
kafkaProperties.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, consumerSessionTimeoutMs);
kafkaProperties.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, maxPartitionFetchBytes);
kafkaProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
I've created 15 consumer threads and let them process the below runnable .I don't have any other consumer with this consumer group name consuming .
#Override
public void run() {
try {
logger.info("Starting ConsumerWorker, consumerId={}", consumerId);
consumer.subscribe(Arrays.asList(kafkaTopic), offsetLoggingCallback);
while (true) {
boolean isPollFirstRecord = true;
logger.debug("consumerId={}; about to call consumer.poll() ...", consumerId);
ConsumerRecords<String, String> records = consumer.poll(pollIntervalMs);
Map<Integer,Long> partitionOffsetMap = new HashMap<>();
for (ConsumerRecord<String, String> record : records) {
if (isPollFirstRecord) {
isPollFirstRecord = false;
logger.info("Start offset for partition {} in this poll : {}", record.partition(), record.offset());
}
messageProcessor.processMessage(record.value(), record.offset());
partitionOffsetMap.put(record.partition(),record.offset());
}
if (!records.isEmpty()) {
logger.info("Invoking commit for partition/offset : {}", partitionOffsetMap);
consumer.commitAsync(offsetLoggingCallback);
}
}
} catch (WakeupException e) {
logger.warn("ConsumerWorker [consumerId={}] got WakeupException - exiting ... Exception: {}",
consumerId, e.getMessage());
} catch (Exception e) {
logger.error("ConsumerWorker [consumerId={}] got Exception - exiting ... Exception: {}",
consumerId, e.getMessage());
} finally {
logger.warn("ConsumerWorker [consumerId={}] is shutting down ...", consumerId);
consumer.close();
}
}
I also have a OffsetCommitCallbackImpl like below . It basically maintains the partition's and their commited offset as map .It also logs whenever offset is committed .
#Override
public void onComplete(Map<TopicPartition, OffsetAndMetadata> offsets, Exception exception) {
if (exception == null) {
offsets.forEach((topicPartition, offsetAndMetadata) -> {
partitionOffsetMap.put(topicPartition, offsetAndMetadata);
logger.info("Offset position during the commit for consumerId : {}, partition : {}, offset : {}", Thread.currentThread().getName(), topicPartition.partition(), offsetAndMetadata.offset());
});
} else {
offsets.forEach((topicPartition, offsetAndMetadata) -> logger.error("Offset commit error, and partition offset info : {}, partition : {}, offset : {}", exception.getMessage(), topicPartition.partition(), offsetAndMetadata.offset()));
}
}
Problem/Issue :
I noticed that i miss events/messages whenever i (restart) bring the application down and bring it back up . So when i closely looked at the logging . by comparing the offsets that are committed(using offsetcommitcallback logging) before shutdown vs offsets that are picked up for processing after restart, i see that for certain partition we did not pickup the offset where we left before shutdown. sometimes the start offset for certain partition's are like 1000 more than the committed offset .
NOTE : This happens to like 8 out of 40 partitions
If you closely look at the logging in run method there is one log statement where i actually print the offset before invoking async commit . For example if that last log before shutdown shows that as 10 for partition 1 . After restart the first offset we are processing for partition 1 would be like 100 . And i validated that we are exactly missing 90 messages .
Can any one think of a reason why this would be happening ?