Kafka streams throwing OutOfMemory exceptions continuously and stops working

Kafka streams throwing OutOfMemory exceptions continuously and stops working - java

I have set up a streams application which consumes messages from one topic, transforms them and put it to the other topic, if any error happens in serialization it puts the records to the error topic.
The load of messages is huge (in millions). The stream app was working perfectly fine until a few days ao, we loaded around 70M data and it was still doing good, then day before yesterday we added another stream to the same application and started streaming the data, now the application crashes with OOM exceptions. Each of the streams have different topics and consumer groups assigned.
The applications runs for an hour or so and then crashes with "java.lang.OutOfMemoryError: Java heap space" errors.
This application is behaving very strangely, we increased the heap size(Xmx) to 2G on each node, our topology is 2 nodes are running the application which are connected to Kafka broker which is running 3 nodes.
There were no network issues but I frequently see "Attempt to heartbeat failed since group is rebalancing" and consumer rebalancing happening in the logs only for the newly added stream.
kafka clients version - 2.3.1
kafka broker - 2.11
Kafka streams configuration:
```props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaProperties.getBootstrapServers());
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, SendAndContinueExceptionHandler.class);
props.put(StreamsConfig.DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG, CustomProductionExceptionHandler.class);
props.put(ProducerConfig.RETRIES_CONFIG, "1");
props.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, CustomPartitioner.class);
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
Streams creation code:
```#Bean
public Set<KafkaStreams> kStreamJson(StreamsBuilder builder) {
Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);
final KStream<String, JsonNode> infoStream = builder.stream(inputTopic, Consumed.with(Serdes.String(), jsonSerde));
Properties infoProps = kStreamsConfigs().asProperties();
infoProps.put(StreamsConfig.APPLICATION_ID_CONFIG, migrationMOIProfilesGroupId);
infoStream
.map(IProcessX::process)
.through(
outputTopic,
Produced.with(Serdes.String(), new JsonPOJOSerde<>(Message.class)));
return Sets.newHashSet(
new KafkaStreams(builder.build(), infoProps)
);
}
Errors received:
[8/6/20, 22:22:54:070 GST] 00000076 SystemOut O 2020-08-06 22:22:54.070 INFO 83225 --- [s-streams-group] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=streams-group-5c15c2c1-798f-4b4d-91a8-24bd9a093fe6-StreamThread-18-consumer, groupId=streams-group] Discovered group coordinator kafka.broker:9092 (id: 2147483644 rack: null)
[8/6/20, 22:23:22:831 GST] 00000d95 SystemOut O 2020-08-06 21:27:59.979 ERROR 83225 --- [| producer-3356] o.apache.kafka.common.utils.KafkaThread : Uncaught exception in thread 'kafka-producer-network-thread | producer-3356':
java.lang.OutOfMemoryError: Java heap space
I have checked session.timeout.ms, heartbeat.timeout.ms, max.poll.interval.ms, max.poll.records but Im not sure what values to set for them.
Please help me solve the issue.

Related

Kafka Stream StateStore infinite loop

We have a KStream app that uses in-memory KV StateStore but with changelog disabled.
String stateStoreName = "statestore-v1";
StoreBuilder<KeyValueStore<String, Event>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.inMemoryKeyValueStore(stateStoreName),
Serdes.String(), new JsonSerde<>(Event.class));
keyValueStoreBuilder.withLoggingDisabled();
streamsBuilder.addStateStore(keyValueStoreBuilder);
We now want to enable the changelog with different configuration and different name.
String stateStoreName = "statestore-v2";
StoreBuilder<KeyValueStore<String, Event>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.inMemoryKeyValueStore(stateStoreName),
Serdes.String(), new JsonSerde<>(Event.class));
Map<String, String> changelogConfig = new HashMap<>();
changelogConfig.put("retention.ms", "43200000"); // 12 hours
changelogConfig.put("cleanup.policy", "delete");
changelogConfig.put("auto.offset.reset", "latest");
keyValueStoreBuilder.withLoggingEnabled(changelogConfig);
streamsBuilder.addStateStore(keyValueStoreBuilder);
When we run our application, we got into infinite loop with these messages:
2022-10-11 13:02:32.761 app=myapp INFO 54561 --- [-StreamThread-3]
o.a.k.s.p.i.StoreChangelogReader : stream-thread [myapp-StreamThread-3]
End offset for changelog myapp-statestore-v2-changelog-4 cannot be found;
will retry in the next time.
2022-10-11 13:02:32.761 app=myapp INFO 54561 --- [-StreamThread-3]
o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=myapp-StreamThread-3-restore-consumer, groupId=null]
Unsubscribed all topics or patterns and assigned partitions
It does not appear that the changelog topic is ever created... At least kafka-topics does not show it.
I am using io.confluent packages version 7.2.2-ccs, which I think translates to Apache Kafka version 3.2.x
Any ideas on how to fix the infinite loop and get the changelog topics created?
Thanks!

The infinite loop was caused because we were doing Blue/Green deployment. We learned that we can not do this if we are changing anything with the StateStore (configuration or disabling/re-enabling changelogs).
We just did a complete shutdown of the old version, then deployed the new version. That worked fine.
Another option would be to use the kafka-streams-application-reset tool as OneKricketeer suggested.

How to manage RecordTooLargeException avoiding Flink job restarting

Is there any way to ignore oversized messages without Flink job restarting?
If I try to produce (using KafkaSink ) a message which is too large (greater than max.message.bytes) then the RecordTooLargeException occurs and the Flink job restarts, then this "exception&restart" cycle is repeating endlessly!
I don't need to increase messages size limits such as max.message.bytes (Kafka Topic Config) and max.request.size (Flink Producer Config), they are good, they are already big. I just want to handle the situation when an unrealistically large message is trying to be produced. In this case, this big message should be ignored, and an error should be logged, and any Runtime Exception should NOT occur, and the endless restarting loop should NOT start.
I tried to use ProducerInterceptor -> it cannot intercept/reject a message, it can just modify it.
I tried to ignore oversized messages in SerializationSchema (implemented a custom wrapper of SerializationSchema) -> it cannot discard message producing too.
I am trying to overwrite KafkaWriter and KafkaSink classes, but it seems to be challenging.
I will be grateful for any advice!
A few quick environment details:
Kafka version is 2.8.1
Flink code is Java code based on the newer KafkaSource/KafkaSink API, not the
older KafkaConsumer/KafkaProduer API.
The flink-clients and flink-connector-kafka version is 1.15.0
Code sample which throws the RecordTooLargeException:
int numberOfRows = 1;
int rowsPerSecond = 1;
DataStream<String> stream = environment.addSource(
new DataGeneratorSource<>(
RandomGenerator.stringGenerator(1050000), // max.message.bytes=1048588
rowsPerSecond,
(long) numberOfRows),
TypeInformation.of(String.class))
.setParallelism(1)
.name("string-generator");
KafkaSinkBuilder<String> builder = KafkaSink.<String>builder()
.setBootstrapServers("localhost:9092")
.setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.setRecordSerializer(
KafkaRecordSerializationSchema.builder().setTopic("test.output")
.setValueSerializationSchema(new SimpleStringSchema())
.build());
KafkaSink<String> sink = builder.build();
stream.sinkTo(sink).setParallelism(1).name("output-producer");
Exception Stack Trace:
2022-06-02/14:01:45.066/PDT [flink-akka.actor.default-dispatcher-4] INFO output-producer: Writer -> output-producer: Committer (1/1) (a66beca5a05c1c27691f7b94ca6ac025) switched from RUNNING to FAILED on 271b1b90-7d6b-4a34-8116-3de6faa8a9bf # 127.0.0.1 (dataPort=-1). org.apache.flink.util.FlinkRuntimeException: Failed to send data to Kafka null with FlinkKafkaInternalProducer{transactionalId='null', inTransaction=false, closed=false} at org.apache.flink.connector.kafka.sink.KafkaWriter$WriterCallback.throwException(KafkaWriter.java:440) ~[flink-connector-kafka-1.15.0.jar:1.15.0] at org.apache.flink.connector.kafka.sink.KafkaWriter$WriterCallback.lambda$onCompletion$0(KafkaWriter.java:421) ~[flink-connector-kafka-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:353) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:317) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) ~[flink-runtime-1.15.0.jar:1.15.0] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292] Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1050088 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.

Error while fetching metadata with correlation id 92 : {myTest=UNKNOWN_TOPIC_OR_PARTITION}

I have created a sample application to check my producer's code. My application runs fine when I'm sending data without a partitioning key. But, on specifying a key for data partitioning I'm getting the error:
[kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Error while fetching metadata with correlation id 37 : {myTest=UNKNOWN_TOPIC_OR_PARTITION}
[kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Error while fetching metadata with correlation id 38 : {myTest=UNKNOWN_TOPIC_OR_PARTITION}
[kafka-producer-network-thread | producer-1] WARN org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Error while fetching metadata with correlation id 39 : {myTest=UNKNOWN_TOPIC_OR_PARTITION}
for both consumer and producer. I have searched a lot on the internet, they have suggested to verify kafka.acl settings. I'm using kafka on HDInsight and I have no idea how to verify it and solve this issue.
My cluster has following configuration:
Head Node: 2
Worker Node:4
Zookeeper: 3
MY producer code:
public static void produce(String brokers, String topicName) throws IOException{
// Set properties used to configure the producer
Properties properties = new Properties();
// Set the brokers (bootstrap servers)
properties.setProperty("bootstrap.servers", brokers);
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
// specify the protocol for Domain Joined clusters
//To create an Idempotent Producer
properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
properties.setProperty(ProducerConfig.ACKS_CONFIG, "all");
properties.setProperty(ProducerConfig.RETRIES_CONFIG, Integer.toString(Integer.MAX_VALUE));
properties.setProperty(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "test-transactional-id");
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
producer.initTransactions();
// So we can generate random sentences
Random random = new Random();
String[] sentences = new String[] {
"the cow jumped over the moon",
"an apple a day keeps the doctor away",
"four score and seven years ago",
"snow white and the seven dwarfs",
"i am at two with nature",
};
for(String sentence: sentences){
// Send the sentence to the test topic
try
{
String key=sentence.substring(0,2);
producer.beginTransaction();
producer.send(new ProducerRecord<String, String>(topicName,key,sentence)).get();
}
catch (Exception ex)
{
System.out.print(ex.getMessage());
throw new IOException(ex.toString());
}
producer.commitTransaction();
}
}
Also, My topic consists of 3 partitions with replication factor=3

I made the replication factor less than the number of partitions and it worked for me. It sounds odd to me but yes, it started working after it.

The error clearly states that the topic (or partition) you are producing to does not exist.
Ultimately, you will need to describe the topic (via CLI kafka-topics --describe --topic <topicName> or other means) to verify if this is true
Kafka on HDInsight and I have no idea how to verify it and solve this issue.
ACLs are only setup if you installed the cluster with them, but I believe you can still list ACLs via zookeper-shell or SSHing into one of Hadoop masters.

I too had the same issue while creating a new topic. And when I described the topic, I could see that leaders were not assigned to the topic partitions.
Topic: xxxxxxxxx Partition: 0 Leader: none Replicas: 3,2,1 Isr:
Topic: xxxxxxxxx Partition: 1 Leader: none Replicas: 1,3,2 Isr:
After some googling, figured out that this could happen when we some issue with controller broker, so restarted the controller broker.
And Everything worked as expected...!

If the topic exists but you're still seeing this error, it could mean that the supplied list of brokers is incorrect. Check the bootstrap.servers value, it should be pointing to the right Kafka cluster where the topic resides.
I saw the same issue and I have multiple Kafka clusters and the topic clearly exists. However, my list of brokers was incorrect.

Kafka: This server is not the leader for that topic-partition

Possible duplicate of Kafka - This server is not the leader for that topic-partition but there is no accepted answer nor a clear solution.
I have a simple java program to produce the message to Kafka:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 1);
props.put("batch.size", 16384);
props.put("linger.ms", 100);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "com.company.project.KafkaValueSerializer");
MyMessage message = new MyMessage();
Producer<String, MyMessage> producer = new KafkaProducer<>(props);
Future<RecordMetadata> future =
producer.send(new ProducerRecord<String, MyMessage>("My_Topic", "", message));
I am getting the following exception:
Exception in thread "main" java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29)
at
Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
When I tried with kafka-console-producer, I am getting the following error:
D:\kafka_2.11-0.10.2.0\bin\windows>kafka-console-producer.bat --broker-list localhost:9092 --topic My_Topic
hello
[2018-10-25 17:05:27,225] WARN Got error produce response with correlation id 3 on topic-partition My_Topic-0, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)
[2018-10-25 17:05:27,335] WARN Got error produce response with correlation id 5 on topic-partition My_Topic-0, retrying (1 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)
[2018-10-25 17:05:27,444] WARN Got error produce response with correlation id 7 on topic-partition My_Topic-0, retrying (0 attempts left). Error: NOT_LEADER_FOR_PARTITION (org.apache.kafka.clients.producer.internals.Sender)
[2018-10-25 17:05:27,544] ERROR Error when sending message to topic My_Topic with key: null, value: 5 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
When I describe my topic, I have the following information:
Topic:My_Topic PartitionCount:1 ReplicationFactor:1 Configs:
Topic: My_Topic Partition: 0 Leader: 0 Replicas: 0 Isr: 0
I tried creating a new topic and producing messages as mentioned in Quick start guide then the above steps work well.
What correction should I make in My_Topic or producer configuration so that I can publish messages successfully in this topic?

Had it been the case of "console client working, but Java program not working", then the solution of 'changing retry limit' may have helped.
As both the Java program and the built-in command-line producer are not able to connect to Kafka, I suspect that the problem could be due to stale configuration.
(Example: topic deleted and created again with the same name, but with different partition count).
Deleting the old log files of zookeeper and Kafka and creating the topic again, or creating a topic with a different name will resolve the issue.

Understanding commit failure from the consumer when leader has changed

Consider following real obfuscated logs:
19:33:48,409 99733391 (pool-6-thread-11) ERROR [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Offset commit failed on partition service_megaman_mt-mcdonalnds_service_msg-1 at offset 75796: This is not the correct coordinator.
19:33:48,410 99733392 (pool-6-thread-11) INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Group coordinator kafka1.maria4.internal:9092 (id: 2147483646 rack: null) is unavailable or invalid, will attempt rediscovery
19:33:48,414 99733396 (kafka-producer-network-thread | producer-1) WARN [org.apache.kafka.clients.producer.internals.Sender] [] [Producer clientId=producer-1] Got error produce response with correlation id 16386 on topic-partition service_megaman_mo-mcdonalnds_service_msg-1, retrying (99 attempts left). Error: NOT_LEADER_FOR_PARTITION
19:33:48,510 99733492 (pool-6-thread-11) INFO [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Discovered group coordinator kafka3.maria4.internal:9092 (id: 2147483644 rack: null)
19:33:48,528 99733510 (pool-6-thread-11) ERROR [org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] [] [Consumer clientId=app2.maria1.mcdonalnds_service_msg, groupId=mcdonalnds_service_msg] Offset commit failed on partition service_megaman_mt-mcdonalnds_service_msg-1 at offset 75796: The coordinator is not aware of this member.
19:33:48,528 99733510 (pool-6-thread-11) ERROR [com.bob.kafka.consumer.ListenableKafkaConsumer] [] Aborting consumer [mcdonalnds_service_msg] for topics [[service_megaman_mt-mcdonalnds_service_msg]] operation due to failure! Cause:
org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
As far as I understand the exception message about poll() is not really the cause. So what happened:
1. Coordinator was not available
2. Consumer found new coordinator
3. New coordinator did not recognise offset so rejected the commit
What I am trying to figure out is the options to recover from this situation. This is not intermittent issue but happened once in a year so the setting on poll would not have helped if leader died.
What happens now: Original application code was simply closing consumers which is wrong , caused alerts and woke up just about everyone as application stopped consuming messages :-)
What I want to happen:
Consumer is restarted, doesn't die if loses connection to coordinator
What I am not sure about:
Why coordinator is not aware of this member
If I understand the issue correctly. :-)
On service side with Java Kafka lib for KafkaConsumer class I
should call close and subscribe or unsubscribe and
subscribe to fullfill my consumer recovery scenario.
What is going to happen to the processed
offset
which was rejected by new coordinator? Since offset was not commited I assume consumer will re-read the same messages?
Following post for Spring-kafka looks very similar issue but the service does not use Spring so this is of limited use to me.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka streams throwing OutOfMemory exceptions continuously and stops working - java

Related

Kafka Stream StateStore infinite loop

How to manage RecordTooLargeException avoiding Flink job restarting

Error while fetching metadata with correlation id 92 : {myTest=UNKNOWN_TOPIC_OR_PARTITION}

Kafka: This server is not the leader for that topic-partition

Understanding commit failure from the consumer when leader has changed

Categories

Resources