I'm working with Kafka 0.8 & zookeeper 3.3.5. Actually, we have a dozen of topic we are consuming without any issue.
Recently, we started to feed and consume a new topic that has a weird behavior. The consumed offset was suddenly reset. It respects the auto.offset.reset policy we set (actually smallest) but I cannot understand why the topic suddenly reset its offset.
I'm using the high-level consumer.
Here are some ERROR log I found:
We have a bunch of this error log:
[2015-03-26 05:21:17,789] INFO Fetching metadata from broker id:1,host:172.16.23.1,port:9092 with correlation id 47 for 1 topic(s) Set(MyTopic) (kafka.cl
ient.ClientUtils$)
[2015-03-26 05:21:17,789] ERROR Producer connection to 172.16.23.1:9092 unsuccessful (kafka.producer.SyncProducer)
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:681)
at kafka.network.BlockingChannel.connect(BlockingChannel.scala:57)
at kafka.producer.SyncProducer.connect(SyncProducer.scala:141)
at kafka.producer.SyncProducer.getOrMakeConnection(SyncProducer.scala:156)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:68)
at kafka.producer.SyncProducer.send(SyncProducer.scala:112)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:53)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:88)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
Each time this issue happened I see that WARN log:
[2015-03-26 05:21:30,596] WARN Reconnect due to socket error: null (kafka.consumer.SimpleConsumer)
And then the real problem happens:
[2015-03-26 05:21:47,551] INFO Connected to 172.16.23.5:9092 for producing (kafka.producer.SyncProducer)
[2015-03-26 05:21:47,552] INFO Disconnecting from 172.16.23.5:9092 (kafka.producer.SyncProducer)
[2015-03-26 05:21:47,553] INFO [ConsumerFetcherManager-1427047649942] Added fetcher for partitions ArrayBuffer([[MyTopic,0], initOffset 45268422051 to br
oker id:5,host:172.16.23.5,port:9092] ) (kafka.consumer.ConsumerFetcherManager)
[2015-03-26 05:21:47,553] INFO [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Starting (kafka.consumer.Cons
umerFetcherThread)
[2015-03-26 05:21:50,388] ERROR [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Current offset 45268422051 for partition [MyTopic,0] out of range; reset offset to 1948447612 (kafka.consumer.ConsumerFetcherThread)
[2015-03-26 05:21:50,490] ERROR [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Current offset 1948447612 for partition [MyTopic,0] out of range; reset offset to 1948447612 (kafka.consumer.ConsumerFetcherThread)
[2015-03-26 05:21:50,591] ERROR [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Current offset 1948447612 for partition [MyTopic,0] out of range; reset offset to 1948447612 (kafka.consumer.ConsumerFetcherThread)
[2015-03-26 05:21:50,692] ERROR [ConsumerFetcherThread-MyTopic_group-1427047649884-699191d4-0-5], Current offset 1948447612 for partition [MyTopic,0] out of range; reset offset to 1948447612 (kafka.consumer.ConsumerFetcherThread)
Now the questions:
Is there someone who has already experienced this behavior ?
Is there someone who can tell me when Kafka decides to reset its offset whether the auto.offset.reset is largest or smallest ?
Thank you.
What's happening is you are depiling your topic too slowly for a while.
Kafka has a retention model which is not based on whether consumer got the data, but on disk usage and/or period. At some point, you get too late, and the next message you need has already been wiped out, and isn't available anymore, due to kafka having cleaned the data. Hence the Current offset 45268422051 for partition [MyTopic,0] out of range; reset offset to 1948447612 messages.
Your consumer then applies your reset policy to bootstrap itself again, in your case smallest.
It's a common issue when you have bursty workflows, and sometimes fall out of the data retention range. It probably disappeared because you improved your depiling speed, or increased your retention policy to be able to survive bursts.
Related
I recently tried to switch my subscriptions in GCP Pub/Sub to the "exactly-once" delivery strategy. However, I started seeing the following warnings ~10 times every 30 minutes in my application logs:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:98)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:67)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1041)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1215)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:535)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.
at io.grpc.Status.asRuntimeException(Status.java:535)
... 17 more
They're immediately followed by the following INFO log messages in the same thread:
Permanent error invalid ack id message, will not resend
I didn't see any problems caused by these warnings, but it's a bit hard to tell because my application is processing a decent number of messages (~1000/hour). I initially thought that these warnings were just short-term "aftershocks" from switching to the "exactly-once" strategy. However, I waited for about 2 hours and they kept occurring with the same frequency, showing no sign of disappearing. I then disabled the "exactly-once" strategy and they went away immediately after. Can anyone tell me whether these warnings are dangerous, what side effects I can expect, and most importantly how I can get rid of them?
I'm using version 3.4.0 of spring-cloud-gcp-dependencies and spring-cloud-gcp-starter-pubsub. I'm also using Spring Cloud Stream to process the incoming messages and I rely on it to automatically acknowledge the messages.
I have the following configuration set in my application.yaml file:
spring:
cloud:
gcp:
pubsub:
subscriber:
executor-threads: 15
max-ack-extension-period: 23400 # 6 hours and 30 minutes
acknowledgement-deadline: 600 # Maximum value
For context: The messages in my application represent jobs for execution, and they could take quite a while to finish - hence the 6h30m maximum acknowledgement extension period.
I also saw the following StackOverflow question:
How to handle errors during message acknowledgement using google pubsub java library?
From what I understand, the consequence of these warnings is that the messages will be redelivered to my application, but this is exactly what I want to avoid.
Thanks for the question, Alexander.
The errors you are seeing happen when the modifyAckDeadline or Acknowledgement request to the service fail because the acknowledgement id is already expired. In such cases, the service considers the expired acknowledgement id as invalid, since a newer delivery might already be in-flight. This is as-per the guarantees for exactly once delivery. There could be a few reasons for it:
The request was delayed due to network delays and by the time it arrived at the server, the acknowledgement id lease is already expired.
The tasks issuing the modifyAckDeadline or Acknowledgement request is overwhelmed (high CPU/memory/network usage), leading to delay in issuing those requests.
I suggest setting min-duration-per-ack-extension to a high number to reduce issues mentioned above. This will help mitigate the cases where the acknowledgement id lease expired. The highest value you can set for this field is 600 seconds.
Additionally, as mentioned in the other stack overflow question, you should consider checking the response of your acknowledgement operations. This can be used to guide your application if it can expect a redelivery. Sample.
I am getting
Offset commit failed on partition app-KSTREAM-MAP-0000000017-repartition-2 at offset 2768614: The request timed out.
I have already increased the request timeout to 1 minute but it didn't help. I am using versions:
spring-kafka: 2.1.12.RELEASE
kafka-clients, kafka-streams: 2.1.1
kafka_2.11: 2.1.1
Try reducing the batch size ConsumerConfig.MAX_POLL_RECORDS_CONFIG
Also look at tuning these
ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG
ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG
Consider following real obfuscated logs:
19:33:48,409 99733391 (pool-6-thread-11) ERROR [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Offset commit failed on partition service_megaman_mt-mcdonalnds_service_msg-1 at offset 75796: This is not the correct coordinator.
19:33:48,410 99733392 (pool-6-thread-11) INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Group coordinator kafka1.maria4.internal:9092 (id: 2147483646 rack: null) is unavailable or invalid, will attempt rediscovery
19:33:48,414 99733396 (kafka-producer-network-thread | producer-1) WARN [org.apache.kafka.clients.producer.internals.Sender] [] [Producer clientId=producer-1] Got error produce response with correlation id 16386 on topic-partition service_megaman_mo-mcdonalnds_service_msg-1, retrying (99 attempts left). Error: NOT_LEADER_FOR_PARTITION
19:33:48,510 99733492 (pool-6-thread-11) INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Discovered group coordinator kafka3.maria4.internal:9092 (id: 2147483644 rack: null)
19:33:48,528 99733510 (pool-6-thread-11) ERROR [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Offset commit failed on partition service_megaman_mt-mcdonalnds_service_msg-1 at offset 75796: The coordinator is not aware of this member.
19:33:48,528 99733510 (pool-6-thread-11) ERROR [com.bob.kafka.consumer.ListenableKafkaConsumer] [] Aborting consumer [mcdonalnds_service_msg] for topics [[service_megaman_mt-mcdonalnds_service_msg]] operation due to failure! Cause:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
As far as I understand the exception message about poll() is not really the cause. So what happened:
1. Coordinator was not available
2. Consumer found new coordinator
3. New coordinator did not recognise offset so rejected the commit
What I am trying to figure out is the options to recover from this situation. This is not intermittent issue but happened once in a year so the setting on poll would not have helped if leader died.
What happens now: Original application code was simply closing consumers which is wrong , caused alerts and woke up just about everyone as application stopped consuming messages :-)
What I want to happen:
Consumer is restarted, doesn't die if loses connection to coordinator
What I am not sure about:
Why coordinator is not aware of this member
If I understand the issue correctly. :-)
On service side with Java Kafka lib for KafkaConsumer class I
should call close and subscribe or unsubscribe and
subscribe to fullfill my consumer recovery scenario.
What is going to happen to the processed
offset
which was rejected by new coordinator? Since offset was not commited I assume consumer will re-read the same messages?
Following post for Spring-kafka looks very similar issue but the service does not use Spring so this is of limited use to me.
We have a Hazelcast client (3.7.4):
//Initializes Hazelcast client config
ClientConfig aHazelcastClientConfig = new ClientConfig();
String aHazelcastUrl = this.getHost()+":"+this.getPort().toString();
ClientNetworkConfig aHazelcastNetworkConfig=
aHazelcastClientConfig.getNetworkConfig();
aHazelcastNetworkConfig.addAddress(aHazelcastUrl);
GroupConfig group = new GroupConfig (getGroupName(),getGroupPassword());
aHazelcastClientConfig.setGroupConfig(group);
HazelcastInstance aHazelcastClient=
HazelcastClient.newHazelcastClient(aHazelcastClientConfig);
...
IMap aMonitoredMap = aHazelcastClient.getMap(getMonitoredMap());
that periodically checks one HZ Server (3.7.4), and we have observed sometimes next exceptions are appearing in the client side:
InitializeDistributedObjectOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-02-07 18:07:30.329. Total elapsed time: 120189 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-02-07 18:05:37.489. Invocation{op=com.hazelcast.spi.impl.proxyservice.impl.operations.InitializeDistributedObjectOperation{serviceName='hz:impl:mapService', identityHash=9759664, partitionId=-1, replicaIndex=0, callId=0, invocationTime=1486487130140 (2017-02-07 18:05:30.140), waitTimeout=-1, callTimeout=60000}, tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1486487130140, firstInvocationTime='2017-02-07 18:05:30.140', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 01:00:00.000', target=[10.118.152.82]:5720, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=7, /172.22.191.200:5720->/10.118.152.82:42563, endpoint=[10.118.152.82]:5720, alive=true, type=MEMBER]}
It seems the maximum call waiting timeout (by default 60000 msecs) is being reached. In the above example, the total elapsed time is more than 2 minutes (120189 ms)
This problem is appearing sporadically, without any regular appearance pattern.
It seems the network is working correctly when it has appeared, so we can discard some network connectivity issue.
Any hint or recommendation about which reasons could provoke it?
Thanks a lot.
Best Regards,
Jorge
I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.