This question already has answers here:
AWS Lambda - How to stop retries when there is a failure
(10 answers)
Closed 3 years ago.
I'm using AWS lambda (Java Runtime) to process some files when there urls are inserted in DynamoDB via another API.
When the lamda dies due to timeout. The lambda is triggered back with the same event and the same process starts again. The process dies again due to timeout and the process starts again.
How can I stop the trigger when timeout occurs ?
The Logs are as follows:
10:53:25 START RequestId: 494ec72d-45bb-409f-bdd4-7653033eefda Version: $LATEST
10:53:25
Received event: com.amazonaws.services.lambda.runtime.events.DynamodbEvent#1e4a7dd4
10:53:25
Got an INSERT EVENT
10:53:25
KEY IS 3c39f7ea-76cf-484f-8a11-39bc2d7c1fd8/aws_dummy.pdf
10:55:25
END RequestId: 494ec72d-45bb-409f-bdd4-7653033eefda
10:55:25
REPORT RequestId: 494ec72d-45bb-409f-bdd4-7653033eefda Duration: 120035.00 ms Billed Duration: 120000 ms Memory Size: 512 MB Max Memory Used: 362 MB Init Duration: 2179.60 ms
10:55:25
2019-09-21T10:55:25.761Z 494ec72d-45bb-409f-bdd4-7653033eefda Task timed out after 120.03 seconds
10:55:28
START RequestId: 494ec72d-45bb-409f-bdd4-7653033eefda Version: $LATEST
10:55:28
Received event: com.amazonaws.services.lambda.runtime.events.DynamodbEvent#1e4a7dd4
10:55:28
Got an INSERT EVENT
10:55:28
KEY IS 3c39f7ea-76cf-484f-8a11-39bc2d7c1fd8/aws_dummy.pdf
public Integer handleRequest(DynamodbEvent event, Context context) {
for (DynamodbStreamRecord record : event.getRecords()) {
if (record.getEventName().equals("INSERT")) {
// DO SOME WORK
}
}
return (SOME_INTEGER)
}
I would think
DynamoDBStream -> Lambda (new Lambda) -> SNS -> SQS -> your lambda now. And having Dead Letter Queue for capture failure one.
Couple reasons for my proposal thoughts
1. The pattern SNS -> SQS is a best practice to send an event and later for extends the feature of communication
2. Having DLQ to handle your failture and you have a control how to process the failed one.
Thanks,
If i understood your requirement properly then you want your app always actuve on lambda and avoid cold start.
I would suggest if your requirement to keep your app up 24 hrs go with EC2 or ELb as your aws lambda cost same and you no need any hacks.
Now come to your question bow to do that .
Configure CloudWatch.
From there, go to Events and click Create rule. Set the event type to Schedule, and we’ll run this event every 1 minute.
Select the Lambda function you want to target from the Targets list and Save. you’ll then need to create a name and description.
Now pinging Lambda function every 1 or 5 or 10 or 15 minute as per your need.
Related
When the application is started but not yet warmed(Jit need time), it cannot process the expected RPS.
The problem is in the incoming queue. As the IO thread continues to process requests, there are many requests in the queue that the GC cannot clean up. After the overflow of the Survived generation, GC starts perfom major pause, which slows down the execution of requests even more and after some time the application falls on the OOM.
My application have self warmed readnessProbe (3k random request).
I try to configure count of thread and queue size:
application.yml
micronaut:
server:
port: 8080
netty:
parent:
threads: 2
worker:
threads: 2
executors:
io:
n-threads: 1
parallelism: 1
type: FIXED
scheduled:
n-threads: 1
parallelism: 1
corePoolSize: 1
And some props
System.setProperty("io.netty.eventLoop.maxPendingTasks", "16")
System.setProperty("io.netty.eventexecutor.maxPendingTasks", "16")
System.setProperty("io.netty.eventLoopThreads", "1")
But the queue keeps filling up:
i want to find somw way to restriction input queu size at Micronaut, so that the application does not failed under high load
1、environment
OS version and kernel: CentOS Linux release 7.9.2009 3.10.0-1160.el7.x86_64
nginx version: nginx-1.14.2 (Community nginx)
upstream server(tomcat) version: apache tomcat 8.0.53
JDK version: Oracle jdk1.8.0_144
nginx enable keepalive (with client and upstream server);
HTTP Procotol: HTTP/1.1
2、nginx access log format and tomcat access log patten:
(1) nginx access log format
'$remote_addr - $remote_user [$time_iso8601] '
' "$request" $status $body_bytes_sent $bytes_sent '
' $request_trace_id '
' ["$upstream_addr" "$upstream_status" "$upstream_bytes_received" "$upstream_response_length" "$upstream_cache_status" '
' "$upstream_response_time" "$upstream_connect_time" "$upstream_header_time"] '
' "$request_time" "$http_referer" "$http_user_agent" "$http_x_forwarded_for" "$ssl_protocol"';
self-defined variable: $request_trace_id:
#trace.setting
set $request_trace_id $http_x_request_id;
if ( $request_trace_id = '' ) {
set $request_trace_id $pid-$connection-$bytes_sent-$msec;
}
(2) tomcat access log pattern:
"[%{Y-M-d H:m:s.S+z}t] real_ip:%{X-Real-IP}i remote:%h requestid:%{X-Request-ID}i first_line:"%r" status:%s bytes:%b cost:%Dms commit_time:%Fms Agent:"%{User-Agent}i" %{Connection}i %{Connection}o %{Keep-Alive}i %{Keep-Alive}o"
3、Problematic logging contents
(1) nginx logs
192.168.26.73 - cgpadmin [2021-09-09T09:58:23+08:00] "POST /cgp2-oauth/oauth/check_token HTTP/1.1" 200 12983 13364 6462-1025729-0-1631152697.976 ["127.0.0.1:8801" "200" "13353" "12991" "-" "5.026" "0.000" "5.026"] "5.026" "-" "Java/1.8.0_144" "-" "-"
1631152697.976 timestamp:
2021-09-09 09:58:17.976
(2) tomcat logs
[2021-9-9 9:58:17.993+CST] real_ip:192.168.26.73 remote:127.0.0.1 requestid:6462-1025729-0-1631152697.976 first_line:"POST /cgp2-oauth/oauth/check_token HTTP/1.1" status:200 bytes:12991 cost:17ms commit_time:16ms Agent:"Java/1.8.0_144" - - - -
4、My judgment and analysis
Several times of nginx:
- variables
$request_time 5.026 seconds
$upstream_response_time 5.026 seconds
$upstream_header_time 0.000 seconds
$upstream_connect_time 5.026 seconds
- logs timestamp
nginx deal with proxy_pass : 2021-09-09 09:58:17.976
The moment when nginx processing is completed: 2021-09-09T09:58:23
Several times of tomcat:
- attributes
%D 17 millisecond
%F 16 millisecond
- logs timestamp
The moment when tomcat processing is completed: 2021-9-9 9:58:17.993
Analyze the problem stage:
The total processing time of nginx ($request_time) is because the processing time of upstream ($upstream_response_time);
In upstream processing, nginx first prepares the contents of the request message,
and then establishes a connection with Tomcat.
The connection time is very short (the log I excerpted here is 0. It may be because the connection is maintained.
I have seen other connections, which are also very short (there are cases where it is not 0),
so $upstream_header_Time can be basically confirmed
After nginx connects with tomcat, "start sending request message to Tomcat", "Tomcat receiving" (I understand that there may be a queue,
Tomcat receiving), "Tomcat processing" and "Tomcat returning" "Response message", "nginx receives the first byte of the response message header",
which takes a long time. Start to analyze the problem of which section
The recording time of nginx completion time is 9:58:23 (I don't know milliseconds here, because I use the default variable),
while the total time consumption of upstream is 5.026 seconds.
The time when tomcat returns the message is 9:58:17.993,Subtracting 5.026 seconds from 9:58:23 is about 9:58:18, which is similar to the time
when Tomcat returns the message, indicating that there is a difference between nginx and the client
The time-consuming is not long. We can basically confirm that the time when nginx establishes a connection to Tomcat is about 9:58:18.
We happen to have a $request_trace_ID, This is when nginx starts processing the reverse proxy, it is 2021-09-09 09:58:17.976,
which can basically confirm our conjecture.
The time of Tomcat reverse message is 9:58:17.993,It subtracts the internal processing time of Tomcat (% d 17 ms),
and obtains that the time when Tomcat starts processing requests is 9:58:17.976, which is the same as our $request_trace_id 09:58:17.976
It is basically the same, indicating that nginx starts sending request messages to tomcat, and the processing time of sorting threads
from the pool is basically No.
Therefore, the processing time 5 of upstream is after Tomcat prepares the response message,
Send the response message to nginx.
The time when nginx receives the first byte of the response message header (127.0.0.1) basically has no network problems,
and the monitoring does not see it.
Is there a queue in nginx? Isn't upstream an event callback based on epoll
The system here sometimes has pre problems and sometimes post problems (in this case, post problems).
Please give me some ideas to promote this problem. Thank you!
5、reference
https://www.nginx.com/blog/using-nginx-logging-for-application-performance-monitoring/
Tomcat log: what's the difference between %D and %F
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#variables
https://tomcat.apache.org/tomcat-8.0-doc/config/valve.html
https://juejin.cn/post/6844903887757901832
https://cloud.tencent.com/developer/article/1778734
Below is the description of problem we faced in production. Please note that I could not reproduce the issue in test or local environment and therfore can not provide you with test code.
We have a hazelcast cluster with two members M1, M2 and three clients C1,C2,C3. Hazelcast version is 3.9.
Clients use IMap.tryLock() method with timeout of 10 seconds. After getting the lock, critical and long running operations are performed and finally the lock is released using IMap.unlock() method.
The problem occured in production is as follows:
At some time instant t, we first saw heartbeat failure to M2 at client C2. Afterwards there are errors in fetching partition table casued by com.hazelcast.spi.exception.TargetDisconnectedException:
[hz.client_0.internal-2 ] WARN [] HeartbeatManager - hz.client_0 [mygroup] [3.9] HeartbeatManager failed to connection: .....
[hz.client_0.internal-3 ] WARN [] ClientPartitionService - hz.client_0 [mygroup] [3.9] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, ......
Around 250 ms after initial heartbeat failure, client gets disconnected and then reconnects in 20 ms.
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_DISCONNETED
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_CONNECTED
The problem we are having is, for some keys that are previously acquired by C2, C1 and C3 can not acquire the lock even if it seems to be released by C2. C2 can get the lock, but this puts unacceptable delays
to the application and is not acceptable.. All clients should get since lock is released...
We were notified of the problem after receiving complaints, and then restarted the client application C2.
As documented in http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Lock.html, locks acquired by restarted member (C2 in my case) seemed to be removed after restart operation.
Currently the issue seems to go away, but we are not sure if it will recur.
Do you have any suggestions about the probable cause and more importantly do you have any recommendations?
Would enabling redo-operation in client help for this problem case?
As I tried to explain client seems to recover the problem, but keys remain locked in cluster and this is fatal to my application.
Thanks
It looks like the client had lost the ownership of the lock because of its disconnection from the cluster. You can use IMap#forceUnlock API in cases such as you faced. It releases the lock regardless of the lock owner and it always successfully unlocks, never blocks, and returns immediately.
We have a Hazelcast client (3.7.4):
//Initializes Hazelcast client config
ClientConfig aHazelcastClientConfig = new ClientConfig();
String aHazelcastUrl = this.getHost()+":"+this.getPort().toString();
ClientNetworkConfig aHazelcastNetworkConfig=
aHazelcastClientConfig.getNetworkConfig();
aHazelcastNetworkConfig.addAddress(aHazelcastUrl);
GroupConfig group = new GroupConfig (getGroupName(),getGroupPassword());
aHazelcastClientConfig.setGroupConfig(group);
HazelcastInstance aHazelcastClient=
HazelcastClient.newHazelcastClient(aHazelcastClientConfig);
...
IMap aMonitoredMap = aHazelcastClient.getMap(getMonitoredMap());
that periodically checks one HZ Server (3.7.4), and we have observed sometimes next exceptions are appearing in the client side:
InitializeDistributedObjectOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-02-07 18:07:30.329. Total elapsed time: 120189 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-02-07 18:05:37.489. Invocation{op=com.hazelcast.spi.impl.proxyservice.impl.operations.InitializeDistributedObjectOperation{serviceName='hz:impl:mapService', identityHash=9759664, partitionId=-1, replicaIndex=0, callId=0, invocationTime=1486487130140 (2017-02-07 18:05:30.140), waitTimeout=-1, callTimeout=60000}, tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1486487130140, firstInvocationTime='2017-02-07 18:05:30.140', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 01:00:00.000', target=[10.118.152.82]:5720, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=7, /172.22.191.200:5720->/10.118.152.82:42563, endpoint=[10.118.152.82]:5720, alive=true, type=MEMBER]}
It seems the maximum call waiting timeout (by default 60000 msecs) is being reached. In the above example, the total elapsed time is more than 2 minutes (120189 ms)
This problem is appearing sporadically, without any regular appearance pattern.
It seems the network is working correctly when it has appeared, so we can discard some network connectivity issue.
Any hint or recommendation about which reasons could provoke it?
Thanks a lot.
Best Regards,
Jorge
I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.