Kafka Streams - Offset commit failed. The request timed out

Kafka Streams - Offset commit failed. The request timed out - java

I am getting
Offset commit failed on partition app-KSTREAM-MAP-0000000017-repartition-2 at offset 2768614: The request timed out.
I have already increased the request timeout to 1 minute but it didn't help. I am using versions:
spring-kafka: 2.1.12.RELEASE
kafka-clients, kafka-streams: 2.1.1
kafka_2.11: 2.1.1

Try reducing the batch size ConsumerConfig.MAX_POLL_RECORDS_CONFIG
Also look at tuning these
ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG
ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG

Related

How to manage RecordTooLargeException avoiding Flink job restarting

Is there any way to ignore oversized messages without Flink job restarting?
If I try to produce (using KafkaSink ) a message which is too large (greater than max.message.bytes) then the RecordTooLargeException occurs and the Flink job restarts, then this "exception&restart" cycle is repeating endlessly!
I don't need to increase messages size limits such as max.message.bytes (Kafka Topic Config) and max.request.size (Flink Producer Config), they are good, they are already big. I just want to handle the situation when an unrealistically large message is trying to be produced. In this case, this big message should be ignored, and an error should be logged, and any Runtime Exception should NOT occur, and the endless restarting loop should NOT start.
I tried to use ProducerInterceptor -> it cannot intercept/reject a message, it can just modify it.
I tried to ignore oversized messages in SerializationSchema (implemented a custom wrapper of SerializationSchema) -> it cannot discard message producing too.
I am trying to overwrite KafkaWriter and KafkaSink classes, but it seems to be challenging.
I will be grateful for any advice!
A few quick environment details:
Kafka version is 2.8.1
Flink code is Java code based on the newer KafkaSource/KafkaSink API, not the
older KafkaConsumer/KafkaProduer API.
The flink-clients and flink-connector-kafka version is 1.15.0
Code sample which throws the RecordTooLargeException:
int numberOfRows = 1;
int rowsPerSecond = 1;
DataStream<String> stream = environment.addSource(
new DataGeneratorSource<>(
RandomGenerator.stringGenerator(1050000), // max.message.bytes=1048588
rowsPerSecond,
(long) numberOfRows),
TypeInformation.of(String.class))
.setParallelism(1)
.name("string-generator");
KafkaSinkBuilder<String> builder = KafkaSink.<String>builder()
.setBootstrapServers("localhost:9092")
.setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.setRecordSerializer(
KafkaRecordSerializationSchema.builder().setTopic("test.output")
.setValueSerializationSchema(new SimpleStringSchema())
.build());
KafkaSink<String> sink = builder.build();
stream.sinkTo(sink).setParallelism(1).name("output-producer");
Exception Stack Trace:
2022-06-02/14:01:45.066/PDT [flink-akka.actor.default-dispatcher-4] INFO output-producer: Writer -> output-producer: Committer (1/1) (a66beca5a05c1c27691f7b94ca6ac025) switched from RUNNING to FAILED on 271b1b90-7d6b-4a34-8116-3de6faa8a9bf # 127.0.0.1 (dataPort=-1). org.apache.flink.util.FlinkRuntimeException: Failed to send data to Kafka null with FlinkKafkaInternalProducer{transactionalId='null', inTransaction=false, closed=false} at org.apache.flink.connector.kafka.sink.KafkaWriter$WriterCallback.throwException(KafkaWriter.java:440) ~[flink-connector-kafka-1.15.0.jar:1.15.0] at org.apache.flink.connector.kafka.sink.KafkaWriter$WriterCallback.lambda$onCompletion$0(KafkaWriter.java:421) ~[flink-connector-kafka-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:353) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:317) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753) ~[flink-streaming-java-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) ~[flink-runtime-1.15.0.jar:1.15.0] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) ~[flink-runtime-1.15.0.jar:1.15.0] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_292] Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1050088 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.

Redisson client ; RedisTimeoutException issue

I am using Google cloud managed redis cluster(v5) via redisson(3.12.5)
Following are my SingleServer configurations in yaml file
singleServerConfig:
idleConnectionTimeout: 10000
connectTimeout: 10000
timeout: 3000
retryAttempts: 3
retryInterval: 1500
password: null
subscriptionsPerConnection: 5
clientName: null
address: "redis://127.0.0.1:6379"
subscriptionConnectionMinimumIdleSize: 1
subscriptionConnectionPoolSize: 50
connectionMinimumIdleSize: 40
connectionPoolSize: 250
database: 0
dnsMonitoringInterval: 5000
threads: 0
nettyThreads: 0
codec: !<org.redisson.codec.JsonJacksonCodec> {}
I am getting following exceptions when I increase the load on my application
org.redisson.client.RedisTimeoutException: Unable to acquire connection! Increase connection pool size and/or retryInterval settings Node source: NodeSource
org.redisson.client.RedisTimeoutException: Command still hasn't been written into connection! Increase nettyThreads and/or retryInterval settings. Payload size in bytes: 34. Node source: NodeSource
It seems there is no issue on redis cluster and i think i need to make tweaking in my client side redis connection pooling confs(mentioned above) to make it work.
Please suggest me the changes i need to make in my confs
I am also curious if I should close the Redis connection after making get/set calls. I have tried finding this but found nothing conclusive on how to close Redis connections
One last thing that I want to ask is that is there any mechanism to get Redis connection pool stats(active connection, idle connection etc ) in Redisson
Edit1:
I have tried by changing values following values in 3 different iterations
Iteration 1:
idleConnectionTimeout: 30000
connectTimeout: 30000
timeout: 30000
Iteration 2:
nettyThreads: 0
Iteration 3:
connectionMinimumIdleSize: 100
connectionPoolSize: 750
I have tried these things but nothing has worked for me
Any help is appreciated.
Thanks in advance

Assuming you are getting low memory alerts on your cache JVM.
You may have to analyze the traffic and determine 2 things
Too many parallel cache persists.
Huge chunk of data being persisted.
Both can be determined by the traffic on your server.
For option 1 configuring pool-size would solve you issue, but for option 2 you may have to refactor your code to persist data in smaller chunks.

Try to set nettyThreads = 64 settings

My nginx takes a long time to process. How can I further solve the problem($upstream_header_time)

1、environment
OS version and kernel: CentOS Linux release 7.9.2009 3.10.0-1160.el7.x86_64
nginx version: nginx-1.14.2 (Community nginx)
upstream server(tomcat) version: apache tomcat 8.0.53
JDK version: Oracle jdk1.8.0_144
nginx enable keepalive (with client and upstream server)；
HTTP Procotol: HTTP/1.1
2、nginx access log format and tomcat access log patten:
(1) nginx access log format
'$remote_addr - $remote_user [$time_iso8601] '
' "$request" $status $body_bytes_sent $bytes_sent '
' $request_trace_id '
' ["$upstream_addr" "$upstream_status" "$upstream_bytes_received" "$upstream_response_length" "$upstream_cache_status" '
' "$upstream_response_time" "$upstream_connect_time" "$upstream_header_time"] '
' "$request_time" "$http_referer" "$http_user_agent" "$http_x_forwarded_for" "$ssl_protocol"';
self-defined variable: $request_trace_id:
#trace.setting
set $request_trace_id $http_x_request_id;
if ( $request_trace_id = '' ) {
set $request_trace_id $pid-$connection-$bytes_sent-$msec;
}
(2) tomcat access log pattern:
"[%{Y-M-d H:m:s.S+z}t] real_ip:%{X-Real-IP}i remote:%h requestid:%{X-Request-ID}i first_line:"%r" status:%s bytes:%b cost:%Dms commit_time:%Fms Agent:"%{User-Agent}i" %{Connection}i %{Connection}o %{Keep-Alive}i %{Keep-Alive}o"
3、Problematic logging contents
(1) nginx logs
192.168.26.73 - cgpadmin [2021-09-09T09:58:23+08:00] "POST /cgp2-oauth/oauth/check_token HTTP/1.1" 200 12983 13364 6462-1025729-0-1631152697.976 ["127.0.0.1:8801" "200" "13353" "12991" "-" "5.026" "0.000" "5.026"] "5.026" "-" "Java/1.8.0_144" "-" "-"
1631152697.976 timestamp:
2021-09-09 09:58:17.976
(2) tomcat logs
[2021-9-9 9:58:17.993+CST] real_ip:192.168.26.73 remote:127.0.0.1 requestid:6462-1025729-0-1631152697.976 first_line:"POST /cgp2-oauth/oauth/check_token HTTP/1.1" status:200 bytes:12991 cost:17ms commit_time:16ms Agent:"Java/1.8.0_144" - - - -
4、My judgment and analysis
Several times of nginx:
- variables
$request_time 5.026 seconds
$upstream_response_time 5.026 seconds
$upstream_header_time 0.000 seconds
$upstream_connect_time 5.026 seconds
- logs timestamp
nginx deal with proxy_pass : 2021-09-09 09:58:17.976
The moment when nginx processing is completed: 2021-09-09T09:58:23
Several times of tomcat:
- attributes
%D 17 millisecond
%F 16 millisecond
- logs timestamp
The moment when tomcat processing is completed: 2021-9-9 9:58:17.993
Analyze the problem stage:
The total processing time of nginx ($request_time) is because the processing time of upstream ($upstream_response_time);
In upstream processing, nginx first prepares the contents of the request message,
and then establishes a connection with Tomcat.
The connection time is very short (the log I excerpted here is 0. It may be because the connection is maintained.
I have seen other connections, which are also very short (there are cases where it is not 0),
so $upstream_header_Time can be basically confirmed
After nginx connects with tomcat, "start sending request message to Tomcat", "Tomcat receiving" (I understand that there may be a queue,
Tomcat receiving), "Tomcat processing" and "Tomcat returning" "Response message", "nginx receives the first byte of the response message header",
which takes a long time. Start to analyze the problem of which section
The recording time of nginx completion time is 9:58:23 (I don't know milliseconds here, because I use the default variable),
while the total time consumption of upstream is 5.026 seconds.
The time when tomcat returns the message is 9:58:17.993,Subtracting 5.026 seconds from 9:58:23 is about 9:58:18, which is similar to the time
when Tomcat returns the message, indicating that there is a difference between nginx and the client
The time-consuming is not long. We can basically confirm that the time when nginx establishes a connection to Tomcat is about 9:58:18.
We happen to have a $request_trace_ID, This is when nginx starts processing the reverse proxy, it is 2021-09-09 09:58:17.976,
which can basically confirm our conjecture.
The time of Tomcat reverse message is 9:58:17.993,It subtracts the internal processing time of Tomcat (% d 17 ms),
and obtains that the time when Tomcat starts processing requests is 9:58:17.976, which is the same as our $request_trace_id 09:58:17.976
It is basically the same, indicating that nginx starts sending request messages to tomcat, and the processing time of sorting threads
from the pool is basically No.
Therefore, the processing time 5 of upstream is after Tomcat prepares the response message,
Send the response message to nginx.
The time when nginx receives the first byte of the response message header (127.0.0.1) basically has no network problems,
and the monitoring does not see it.
Is there a queue in nginx? Isn't upstream an event callback based on epoll
The system here sometimes has pre problems and sometimes post problems (in this case, post problems).
Please give me some ideas to promote this problem. Thank you!
5、reference
https://www.nginx.com/blog/using-nginx-logging-for-application-performance-monitoring/
Tomcat log: what's the difference between %D and %F
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#variables
https://tomcat.apache.org/tomcat-8.0-doc/config/valve.html
https://juejin.cn/post/6844903887757901832
https://cloud.tencent.com/developer/article/1778734

Azure App Service - Spring Boot - Hikari Errors

I have deployed Spring Boot application that has a Database based queue with jobs on App Service.
Yesterday I performed a few Scale out and Scale in operations while the application was working to see how it will behave.
At some point (not necessary related to scaling operations) application started to throw Hikari errors.
com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#1ae66f34 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection org.postgresql.jdbc.PgConnection#1ef85079 marked as broken because of SQLSTATE(08006), ErrorCode(0)
The following are stack traces from my scheduled job in spring and other information:
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Caused by: javax.net.ssl.SSLException: Connection reset by peer (Write failed)
Suppressed: java.net.SocketException: Broken pipe (Write failed)
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
Next the following stack of errors:
WARN 1 --- [ scheduling-1] com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#48d0d6da (This connection has been closed.).
Possibly consider using a shorter maxLifetime value.
org.springframework.jdbc.support.MetaDataAccessException: Error while extracting DatabaseMetaData; nested exception is java.sql.SQLException: Connection is closed
Caused by: java.sql.SQLException: Connection is closed
The code which is invoked periodically - every 500 milliseconds is here:
#Scheduled(fixedDelayString = "${worker.delay}")
#Transactional
public void execute() {
jobManager.next(jobClass).ifPresent(this::handleJob);
}
Update.
The above code is almost all the time doing nothing, since there was no traffic on the website.
Update2. I've checked Postgres logs and found this:
2020-07-11 22:48:09 UTC-5f0866f0.f0-LOG: checkpoint starting: immediate force wait
2020-07-11 22:48:10 UTC-5f0866f0.f0-LOG: checkpoint complete (240): wrote 30 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.046 s, sync=0.046 s, total=0.437 s; sync files=13, longest=0.009 s, average=0.003 s; distance=163 kB, estimate=13180 kB
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: received immediate shutdown request
2020-07-11 22:48:10 UTC-5f0a3f41.8914-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0a3f41.8914-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
// Same text about 10 times
2020-07-11 22:48:10 UTC-5f0866f2.7c-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: src/port/kill.c(84): Process (272) exited OOB of pgkill.
2020-07-11 22:48:10 UTC-5f0866f1.fc-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0866f1.fc-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-07-11 22:48:10 UTC-5f0866f1.fc-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: archiver process (PID 256) exited with exit code 1
2020-07-11 22:48:11 UTC-5f0866ee.68-LOG: database system is shut down
It looks like it is a problem with Azure PostgresSQL server and it closed itself. Am I reading this right?

Like mentioned in your logs, have you tried setting maxLifetime property for the Hikari CP ? I think after setting that property this issue should be resolved.
Based on Hikari doc (https://github.com/brettwooldridge/HikariCP) --
maxLifetime
This property controls the maximum lifetime of a connection in the pool. An in-use connection will never be retired, only when it is closed will it then be removed. On a connection-by-connection basis, minor negative attenuation is applied to avoid mass-extinction in the pool. We strongly recommend setting this value, and it should be several seconds shorter than any database or infrastructure imposed connection time limit. A value of 0 indicates no maximum lifetime (infinite lifetime), subject of course to the idleTimeout setting. The minimum allowed value is 30000ms (30 seconds). Default: 1800000 (30 minutes)

How to solve network and memory issues in Kafka brokers?

When using kafka, I got intermittent two network related errors.
1. Error in fetch kafka.server.replicafetcherthread$fetchrequest connection to broker was disconnected before the reponse was read
2. Error in fetch kafka.server.replicafetcherthread$fetchrequest Connection to broker1 (id: 1 rack: null) failed
[configuration environment]
Brokers: 5 / server.properties: "kafka_manager_heap_s=1g", "kafka_manager_heap_x=1g", "offsets.commit.required.acks=1","offsets.commit.timeout.ms=5000", Most settings are the default.
Zookeepers: 3
Servers: 5
Kafka:0.10.1.2
Zookeeper: 3.4.6
Both of these errors are caused by loss of network communication.
If these errors occur, Kafka will work to expand or shrink the ISR partition several times.
expanding-ex) INFO Partition [my-topic,7] on broker 1: Expanding ISR for partition [my-topic,7] from 1,2 to 1,2,3
shrinking-ex) INFO Partition [my-topic,7] on broker 1: Shrinking ISR for partition [my-topic,7] from 1,2,3 to 1,2
I understand that these errors are caused by network problems, but I'm not sure why the break in the network is occurring.
And if this network disconnection persists, I got the following additional error
: Error when handling request(topics=null} java.lang.OutOfMemoryError: Java heap space
I wonder what causes these and how can I improve this?

The network error tells you that one of the brokers is not running, which means it cannot connect to it. As per experience the minimum heap size you can assign is 2Gb.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka Streams - Offset commit failed. The request timed out - java

Try reducing the batch size ConsumerConfig.MAX_POLL_RECORDS_CONFIG Also look at tuning these ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG

Related

How to manage RecordTooLargeException avoiding Flink job restarting

Redisson client ; RedisTimeoutException issue

My nginx takes a long time to process. How can I further solve the problem($upstream_header_time)

Azure App Service - Spring Boot - Hikari Errors

How to solve network and memory issues in Kafka brokers?

Categories

Resources