Consider following real obfuscated logs:
19:33:48,409 99733391 (pool-6-thread-11) ERROR [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Offset commit failed on partition service_megaman_mt-mcdonalnds_service_msg-1 at offset 75796: This is not the correct coordinator.
19:33:48,410 99733392 (pool-6-thread-11) INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Group coordinator kafka1.maria4.internal:9092 (id: 2147483646 rack: null) is unavailable or invalid, will attempt rediscovery
19:33:48,414 99733396 (kafka-producer-network-thread | producer-1) WARN [org.apache.kafka.clients.producer.internals.Sender] [] [Producer clientId=producer-1] Got error produce response with correlation id 16386 on topic-partition service_megaman_mo-mcdonalnds_service_msg-1, retrying (99 attempts left). Error: NOT_LEADER_FOR_PARTITION
19:33:48,510 99733492 (pool-6-thread-11) INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Discovered group coordinator kafka3.maria4.internal:9092 (id: 2147483644 rack: null)
19:33:48,528 99733510 (pool-6-thread-11) ERROR [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Offset commit failed on partition service_megaman_mt-mcdonalnds_service_msg-1 at offset 75796: The coordinator is not aware of this member.
19:33:48,528 99733510 (pool-6-thread-11) ERROR [com.bob.kafka.consumer.ListenableKafkaConsumer] [] Aborting consumer [mcdonalnds_service_msg] for topics [[service_megaman_mt-mcdonalnds_service_msg]] operation due to failure! Cause:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
As far as I understand the exception message about poll() is not really the cause. So what happened:
1. Coordinator was not available
2. Consumer found new coordinator
3. New coordinator did not recognise offset so rejected the commit
What I am trying to figure out is the options to recover from this situation. This is not intermittent issue but happened once in a year so the setting on poll would not have helped if leader died.
What happens now: Original application code was simply closing consumers which is wrong , caused alerts and woke up just about everyone as application stopped consuming messages :-)
What I want to happen:
Consumer is restarted, doesn't die if loses connection to coordinator
What I am not sure about:
Why coordinator is not aware of this member
If I understand the issue correctly. :-)
On service side with Java Kafka lib for KafkaConsumer class I
should call close and subscribe or unsubscribe and
subscribe to fullfill my consumer recovery scenario.
What is going to happen to the processed
offset
which was rejected by new coordinator? Since offset was not commited I assume consumer will re-read the same messages?
Following post for Spring-kafka looks very similar issue but the service does not use Spring so this is of limited use to me.
Related
According to Kafka documentation:
Kafka provides the guarantee that a topic-partition is assigned to
only one consumer within a group.
But I'm observing different behaviour in my service. Here are some details:
I'm using Kafka 2.8 and spring-kafka 2.2.13.
Initially I had one Kafka topic topic.1 with 5 partitions in it, this topic was consumed in my service using #KafkaListener annotation from Spring and ConcurrentKafkaListenerContainerFactory with concurrency == 5. This configuration worked fine for me.
Later I started consuming topic.2 with 3 partitions in it in the same service using the same ConcurrentKafkaListenerContainerFactory and the same group ID.
For some time it also worked correctly but in the one of rebalancings one of the consumer threads was not included in the rebalancing process and for some reasons continued processing previously assigned partition, i.e. a new consumer was assigned to that partition and the old consumer kept processing the same partition, so the same records were processed twice. In the logs I see that records from that partition were consumed and processed twice in two different consumer threads during several days.
After service restart consumers were assigned correctly again.
Here is my code, ConcurrentKafkaListenerContainerFactory bean creation:
#Configuration
public class Config {
#Autowired
private KafkaProperties kafkaProperties;
#Bean
public ConcurrentKafkaListenerContainerFactory<Object, Object> transactionalKafkaContainerFactory(
ConcurrentKafkaListenerContainerFactoryConfigurer configurer,
KafkaTransactionManager kafkaTransactionManager,
ConsumerFactory<Object, Object> consumerFactory) {
ConcurrentKafkaListenerContainerFactory<Object, Object> factory = new ConcurrentKafkaListenerContainerFactory<>();
configurer.configure(factory, consumerFactory);
//If an exception is thrown, then we want to seek back for the whole batch
factory.setBatchErrorHandler(new FixedBackoffSeekToCurrentBatchErrorHandler());
//Enable transactional consumer for exactly-once transitivity
factory.getContainerProperties().setTransactionManager(kafkaTransactionManager);
//Enable batch-processing
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.BATCH);
factory.setBatchListener(true);
factory.setConcurrency(5);
return factory;
}
#Bean
public KafkaTransactionManager<?, ?> kafkaTransactionManager(
#Qualifier("transactionalProducerFactory") ProducerFactory producerFactory) {
return new KafkaTransactionManager(producerFactory);
}
#Bean
public ProducerFactory<?, ?> transactionalProducerFactory() {
DefaultKafkaProducerFactory<?, ?> producerFactory = new DefaultKafkaProducerFactory<>(
kafkaProperties.buildProducerProperties()
);
producerFactory.setTransactionIdPrefix("tx-");
return producerFactory;
}
}
#KafkaListener-s:
#Service
public class Processor {
#KafkaListener(topics = "topic.1",
groupId = "my-group",
containerFactory = "transactionalKafkaContainerFactory")
public void listener1(List<ConsumerRecord<String, MyObject>> records) {
// process records
}
// Listener for topic.2, added later
#KafkaListener(topics = "topic.2",
groupId = "my-group",
containerFactory = "transactionalKafkaContainerFactory")
public void listener2(List<ConsumerRecord<String, MyObjec>> records) {
// process records
}
}
Here are some logs, one of consumers (consumer-7) was not included in the rebalancing process and continued consuming his old partition (topic.1-3) while new consumer (consumer-4) was assigned to the same partition:
03/30/2022 10:23:47.484 [Consumer clientId=consumer-8, groupId=my-group] Setting newly assigned partitions [topic.1-1, topic.1-0]
03/30/2022 10:23:47.484 [Consumer clientId=consumer-4, groupId=my-group] Setting newly assigned partitions [topic.1-3]
03/30/2022 10:23:47.484 [Consumer clientId=consumer-2, groupId=my-group] Setting newly assigned partitions [topic.2-0]
03/30/2022 10:23:47.484 [Consumer clientId=consumer-6, groupId=my-group] Setting newly assigned partitions [topic.1-4]
03/30/2022 10:23:47.483 [Consumer clientId=consumer-5, groupId=my-group] Setting newly assigned partitions [topic.2-2]
03/30/2022 10:23:47.483 [Consumer clientId=consumer-1, groupId=my-group] Setting newly assigned partitions [topic.1-2]
03/30/2022 10:23:47.483 [Consumer clientId=consumer-3, groupId=my-group] Setting newly assigned partitions [topic.2-1]
...
03/30/2022 10:53:55.728 [Consumer clientId=consumer-7, groupId=my-group] Discovered group coordinator ... (id: ... rack: null)
03/30/2022 10:53:55.627 [Consumer clientId=consumer-7, groupId=my-group] Group coordinator ... (id: ... rack: null) is unavailable or invalid, will attempt rediscovery
03/30/2022 10:53:55.627 [Consumer clientId=consumer-7, groupId=my-group] Discovered group coordinator ... (id: ... rack: null)
03/30/2022 10:53:55.507 [Consumer clientId=consumer-7, groupId=my-group] Group coordinator ... (id: ... rack: null) is unavailable or invalid, will attempt rediscovery
03/30/2022 10:53:55.507 [Consumer clientId=consumer-7, groupId=my-group] Discovered group coordinator ... (id: ... rack: null)
Example of processing the same record twice (domain specific details are excluded):
03/30/2022 11:55:05.144 PM +0300 Processing payload -> EventId: 289f43b4-b07b-4a1f-b768-0453e0c42719, Topic: topic.1, Partition: 3, Offset: 10903844 org.springframework.kafka.KafkaListenerEndpointContainer#1-4-C-1
03/30/2022 11:55:05.143 PM +0300 Processing payload -> EventId: 289f43b4-b07b-4a1f-b768-0453e0c42719, Topic: topic.1, Partition: 3, Offset: 10903844 org.springframework.kafka.KafkaListenerEndpointContainer#1-0-C-1
Note: business logic details are not important here, so I used simple names: topic.1, Processor, my-group etc
My question is how this behaviour can be explained? Why consumer-7 were able to consume messages from its old partition after rebalancing?
Could adding new #KafkaListener with the same group ID cause this issue? (at least I didn't see this behaviour when I had only one #KafkaListener in my service)
I don't see any evidence of your assumption in those logs.
One possibility, though, is that consumer-7 took too long to process the records from his poll and he was forcibly evicted from the group (until the next poll) so, yes, that is possible; he will continue to process any records from the previous poll until the next poll when he will detect the rebalance. You should ensure that you can process max.poll.records within max.poll.interval.ms to avoid such situations.
In any case, it is not good practice to use the same group in this situation, because a rebalance on one topic causes an unnecessary rebalance on the other topic, which is undesirable.
You should also upgrade to a supported spring-kafka version; 2.2.x is long out of support.
https://spring.io/projects/spring-kafka#support
The current version is 2.8.5; 2.7.x goes out of OSS support soon.
I have a Kafka streaming application with kafka-streams and kafka-clients both 2.4.0
with the following configs
properties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
properties.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
brokers= ip1:port1, ip2:port2,ip3:port3,
topic partition: 3
topic replication : 3
Scenario 1: I start only 2 brokers (stream app still contains three ips of broker in broker ip setting) and when i start the my stream app the following error occurs.
2020-02-13 13:28:19.711 WARN 18756 --- [-1-0_0-producer] org.apache.kafka.clients.NetworkClient : [Producer clientId=my-app1-a4c8867f-b914-49bb-bc58-203349700828-StreamThread-1-0_0-producer, transactionalId=my-app1-0_0] Connection to node -2 (/ip2:port2) could not be established. Broker may not be available.
and later after 1 minute
org.apache.kafka.streams.errors.StreamsException: stream-thread [my-app1-a4c8867f-b914-49bb-bc58-203349700828-StreamThread-1] Failed to rebalance.
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:852)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:743)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)
Caused by: org.apache.kafka.streams.errors.StreamsException: stream-thread [my-app1-a4c8867f-b914-49bb-bc58-203349700828-StreamThread-1] task [0_0] Failed to initialize task 0_0 due to timeout.
at org.apache.kafka.streams.processor.internals.StreamTask.initializeTransactions(StreamTask.java:966)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:254)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:176)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:355)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:313)
at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.createTasks(StreamThread.java:298)
at org.apache.kafka.streams.processor.internals.TaskManager.addNewActiveTasks(TaskManager.java:160)
at org.apache.kafka.streams.processor.internals.TaskManager.createTasks(TaskManager.java:120)
at org.apache.kafka.streams.processor.internals.StreamsRebalanceListener.onPartitionsAssigned(StreamsRebalanceListener.java:77)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:272)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:400)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:421)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:340)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:471)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1267)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:843)
... 3 common frames omitted
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
I was Testing for High availability test scenarios. I think kafka should still work as replications are present in the two brokers properly(I have checked using kafka GUI tool).
Scenario 2: Today i noticed that when i start only 2 brokers and give the ips of theses two brokers (i.e. stream app only has the ip of two working brokers)
2020-02-16 16:18:24.818 INFO 5741 --- [-StreamThread-1] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=my-app-0a357371-525b-46cf-9fe1-34ee94fa4158-StreamThread-1-consumer, groupId=my-app] Group coordinator ip2:port2 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery
2020-02-16 16:18:24.818 ERROR 5741 --- [-StreamThread-1] o.a.k.s.p.internals.StreamThread : stream-thread [my-app-0a357371-525b-46cf-9fe1-34ee94fa4158-StreamThread-1] Encountered the following unexpected Kafka exception during processing, this usually indicate Streams internal errors:
org.apache.kafka.streams.errors.StreamsException: stream-thread [my-app-0a357371-525b-46cf-9fe1-34ee94fa4158-StreamThread-1] Failed to rebalance.
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:852)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:743)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)
Caused by: org.apache.kafka.streams.errors.StreamsException: stream-thread [my-app-0a357371-525b-46cf-9fe1-34ee94fa4158-StreamThread-1] task [0_0] Failed to initialize task 0_0 due to timeout.
at org.apache.kafka.streams.processor.internals.StreamTask.initializeTransactions(StreamTask.java:966)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:254)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:176)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:355)
at org.apache.kafka.streams.processor.internals.StreamThread$TaskCreator.createTask(StreamThread.java:313)
at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.createTasks(StreamThread.java:298)
at org.apache.kafka.streams.processor.internals.TaskManager.addNewActiveTasks(TaskManager.java:160)
at org.apache.kafka.streams.processor.internals.TaskManager.createTasks(TaskManager.java:120)
at org.apache.kafka.streams.processor.internals.StreamsRebalanceListener.onPartitionsAssigned(StreamsRebalanceListener.java:77)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:272)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:400)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:421)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:340)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:471)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1267)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:843)
... 3 common frames omitted
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
Note: This is not the case if i don['t set EXACTLY_ONCE in properties. Them it works as intended.
Tried increasing reties and back off max ms but didn't help.
Can anyone explain what i am missing?
logs of broker2 when broker 1 is down:
[2020-02-17 02:29:00,302] INFO [ReplicaFetcher replicaId=1, leaderId=3, fetcherId=0] Retrying leaderEpoch request for partition __consumer_offsets-36 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
Kafak logs are filled with the above line.
Now One Major Observation:
When I turn off broker2(ie. broker 1 and broker 3 are running) then my stream application runs fine.
My App shuts down only when broker 1 is down. I'm guessing some critical information that should be distributed between all brokers is only saved in broker 1.
When using kafka, I got intermittent two network related errors.
1. Error in fetch kafka.server.replicafetcherthread$fetchrequest connection to broker was disconnected before the reponse was read
2. Error in fetch kafka.server.replicafetcherthread$fetchrequest Connection to broker1 (id: 1 rack: null) failed
[configuration environment]
Brokers: 5 / server.properties: "kafka_manager_heap_s=1g", "kafka_manager_heap_x=1g", "offsets.commit.required.acks=1","offsets.commit.timeout.ms=5000", Most settings are the default.
Zookeepers: 3
Servers: 5
Kafka:0.10.1.2
Zookeeper: 3.4.6
Both of these errors are caused by loss of network communication.
If these errors occur, Kafka will work to expand or shrink the ISR partition several times.
expanding-ex) INFO Partition [my-topic,7] on broker 1: Expanding ISR for partition [my-topic,7] from 1,2 to 1,2,3
shrinking-ex) INFO Partition [my-topic,7] on broker 1: Shrinking ISR for partition [my-topic,7] from 1,2,3 to 1,2
I understand that these errors are caused by network problems, but I'm not sure why the break in the network is occurring.
And if this network disconnection persists, I got the following additional error
: Error when handling request(topics=null} java.lang.OutOfMemoryError: Java heap space
I wonder what causes these and how can I improve this?
The network error tells you that one of the brokers is not running, which means it cannot connect to it. As per experience the minimum heap size you can assign is 2Gb.
As I understand, one of the brokers is selected as the group coordinator which takes care of consumer rebalancing.
Discovered coordinator host:9092 (id: 2147483646 rack: null) for group good_group
I have 3 nodes with replication factor of 3 and 3 partitions.
Everything is great and when I kill kafka on non-coordinator nodes, consumer is still receiving messages.
But when I kill that specific node with coordinator, rebalancing is not happening and my java consumer app does not receive any messages.
2018-05-29 16:34:22.668 INFO AbstractCoordinator:555 - Discovered coordinator host:9092 (id: 2147483646 rack: null) for group good_group.
2018-05-29 16:34:22.689 INFO AbstractCoordinator:600 - Marking the coordinator host:9092 (id: 2147483646 rack: null) dead for group good_group
2018-05-29 16:34:22.801 INFO AbstractCoordinator:555 - Discovered coordinator host:9092 (id: 2147483646 rack: null) for group good_group.
2018-05-29 16:34:22.832 INFO AbstractCoordinator:600 - Marking the coordinator host:9092 (id: 2147483646 rack: null) dead for group good_group
2018-05-29 16:34:22.933 INFO AbstractCoordinator:555 - Discovered coordinator host:9092 (id: 2147483646 rack: null) for group good_group.
2018-05-29 16:34:23.044 WARN ConsumerCoordinator:535 - Auto offset commit failed for group good_group: Offset commit failed with a retriable exception. You should retry committing offsets.
Am I doing something wrong and is there a way around this?
But when I kill that specific node with coordinator, rebalancing is not happening and my java consumer app does not receive any messages.
The group coordinator receives heartbeats from all consumers in the consumer group. It maintains a list of active consumers and initiates the rebalancing on the change of this list. Then the group leader executes the rebalance activity.
That's why the rebalancing will stop if you kill the group coordinator.
UPDATE
In the case that the group coordinator broker shutdowns, the Zookeeper will be notified and the election starts to promote a new group coordinator from the active brokers automatically. So nothing to do with group coordinator. Let's see the log:
2018-05-29 16:34:23.044 WARN ConsumerCoordinator:535 - Auto offset commit failed for group good_group: Offset commit failed with a retriable exception. You should retry committing offsets.
The replication factor of internal topic __consumer_offset probably has the default value 1. Can you check what value of default.replication.factor and offsets.topic.replication.factor are in the server.properties files. If the values is 1 by default, it should be changed to bigger one. Failing to do so, the group coordinator shutdowns causing offset manager stops without backup. So the activity of committing offsets can not be done.
I'm working with Kafka 0.8 & zookeeper 3.3.5. Actually, we have a dozen of topic we are consuming without any issue.
Recently, we started to feed and consume a new topic that has a weird behavior. The consumed offset was suddenly reset. It respects the auto.offset.reset policy we set (actually smallest) but I cannot understand why the topic suddenly reset its offset.
I'm using the high-level consumer.
Here are some ERROR log I found:
We have a bunch of this error log:
[2015-03-26 05:21:17,789] INFO Fetching metadata from broker id:1,host:172.16.23.1,port:9092 with correlation id 47 for 1 topic(s) Set(MyTopic) (kafka.cl
ient.ClientUtils$)
[2015-03-26 05:21:17,789] ERROR Producer connection to 172.16.23.1:9092 unsuccessful (kafka.producer.SyncProducer)
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:681)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.producer.SyncProducer.connect(SyncProducer.scala:141)
at kafka.producer.SyncProducer.getOrMakeConnection(SyncProducer.scala:156)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:68)
at kafka.producer.SyncProducer.send(SyncProducer.scala:112)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:88)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
Each time this issue happened I see that WARN log:
[2015-03-26 05:21:30,596] WARN Reconnect due to socket error: null (kafka.consumer.SimpleConsumer)
And then the real problem happens:
[2015-03-26 05:21:47,551] INFO Connected to 172.16.23.5:9092 for producing (kafka.producer.SyncProducer)
[2015-03-26 05:21:47,552] INFO Disconnecting from 172.16.23.5:9092 (kafka.producer.SyncProducer)
[2015-03-26 05:21:47,553] INFO [ConsumerFetcherManager-1427047649942] Added fetcher for partitions ArrayBuffer([[MyTopic,0], initOffset 45268422051 to br
oker id:5,host:172.16.23.5,port:9092] ) (kafka.consumer.ConsumerFetcherManager)
[2015-03-26 05:21:47,553] INFO [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Starting (kafka.consumer.Cons
umerFetcherThread)
[2015-03-26 05:21:50,388] ERROR [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Current offset 45268422051 for partition [MyTopic,0] out of range; reset offset to 1948447612 (kafka.consumer.ConsumerFetcherThread)
[2015-03-26 05:21:50,490] ERROR [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Current offset 1948447612 for partition [MyTopic,0] out of range; reset offset to 1948447612 (kafka.consumer.ConsumerFetcherThread)
[2015-03-26 05:21:50,591] ERROR [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Current offset 1948447612 for partition [MyTopic,0] out of range; reset offset to 1948447612 (kafka.consumer.ConsumerFetcherThread)
[2015-03-26 05:21:50,692] ERROR [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Current offset 1948447612 for partition [MyTopic,0] out of range; reset offset to 1948447612 (kafka.consumer.ConsumerFetcherThread)
Now the questions:
Is there someone who has already experienced this behavior ?
Is there someone who can tell me when Kafka decides to reset its offset whether the auto.offset.reset is largest or smallest ?
Thank you.
What's happening is you are depiling your topic too slowly for a while.
Kafka has a retention model which is not based on whether consumer got the data, but on disk usage and/or period. At some point, you get too late, and the next message you need has already been wiped out, and isn't available anymore, due to kafka having cleaned the data. Hence the Current offset 45268422051 for partition [MyTopic,0] out of range; reset offset to 1948447612 messages.
Your consumer then applies your reset policy to bootstrap itself again, in your case smallest.
It's a common issue when you have bursty workflows, and sometimes fall out of the data retention range. It probably disappeared because you improved your depiling speed, or increased your retention policy to be able to survive bursts.