When the application is started but not yet warmed(Jit need time), it cannot process the expected RPS.
The problem is in the incoming queue. As the IO thread continues to process requests, there are many requests in the queue that the GC cannot clean up. After the overflow of the Survived generation, GC starts perfom major pause, which slows down the execution of requests even more and after some time the application falls on the OOM.
My application have self warmed readnessProbe (3k random request).
I try to configure count of thread and queue size:
application.yml
micronaut:
server:
port: 8080
netty:
parent:
threads: 2
worker:
threads: 2
executors:
io:
n-threads: 1
parallelism: 1
type: FIXED
scheduled:
n-threads: 1
parallelism: 1
corePoolSize: 1
And some props
System.setProperty("io.netty.eventLoop.maxPendingTasks", "16")
System.setProperty("io.netty.eventexecutor.maxPendingTasks", "16")
System.setProperty("io.netty.eventLoopThreads", "1")
But the queue keeps filling up:
i want to find somw way to restriction input queu size at Micronaut, so that the application does not failed under high load
Related
This question already has answers here:
AWS Lambda - How to stop retries when there is a failure
(10 answers)
Closed 3 years ago.
I'm using AWS lambda (Java Runtime) to process some files when there urls are inserted in DynamoDB via another API.
When the lamda dies due to timeout. The lambda is triggered back with the same event and the same process starts again. The process dies again due to timeout and the process starts again.
How can I stop the trigger when timeout occurs ?
The Logs are as follows:
10:53:25 START RequestId: 494ec72d-45bb-409f-bdd4-7653033eefda Version: $LATEST
10:53:25
Received event: com.amazonaws.services.lambda.runtime.events.DynamodbEvent#1e4a7dd4
10:53:25
Got an INSERT EVENT
10:53:25
KEY IS 3c39f7ea-76cf-484f-8a11-39bc2d7c1fd8/aws_dummy.pdf
10:55:25
END RequestId: 494ec72d-45bb-409f-bdd4-7653033eefda
10:55:25
REPORT RequestId: 494ec72d-45bb-409f-bdd4-7653033eefda Duration: 120035.00 ms Billed Duration: 120000 ms Memory Size: 512 MB Max Memory Used: 362 MB Init Duration: 2179.60 ms
10:55:25
2019-09-21T10:55:25.761Z 494ec72d-45bb-409f-bdd4-7653033eefda Task timed out after 120.03 seconds
10:55:28
START RequestId: 494ec72d-45bb-409f-bdd4-7653033eefda Version: $LATEST
10:55:28
Received event: com.amazonaws.services.lambda.runtime.events.DynamodbEvent#1e4a7dd4
10:55:28
Got an INSERT EVENT
10:55:28
KEY IS 3c39f7ea-76cf-484f-8a11-39bc2d7c1fd8/aws_dummy.pdf
public Integer handleRequest(DynamodbEvent event, Context context) {
for (DynamodbStreamRecord record : event.getRecords()) {
if (record.getEventName().equals("INSERT")) {
// DO SOME WORK
}
}
return (SOME_INTEGER)
}
I would think
DynamoDBStream -> Lambda (new Lambda) -> SNS -> SQS -> your lambda now. And having Dead Letter Queue for capture failure one.
Couple reasons for my proposal thoughts
1. The pattern SNS -> SQS is a best practice to send an event and later for extends the feature of communication
2. Having DLQ to handle your failture and you have a control how to process the failed one.
Thanks,
If i understood your requirement properly then you want your app always actuve on lambda and avoid cold start.
I would suggest if your requirement to keep your app up 24 hrs go with EC2 or ELb as your aws lambda cost same and you no need any hacks.
Now come to your question bow to do that .
Configure CloudWatch.
From there, go to Events and click Create rule. Set the event type to Schedule, and we’ll run this event every 1 minute.
Select the Lambda function you want to target from the Targets list and Save. you’ll then need to create a name and description.
Now pinging Lambda function every 1 or 5 or 10 or 15 minute as per your need.
I am making a blocking service call in one of my worker verticles that logged a warning. This was "addressed" by increasing the time limit, but, I am more curious about how to read the naming of the thread in the log line - vert.x-worker-thread-3,5,main. The full log was this -
io.vertx.core.impl.BlockedThreadChecker
WARNING: Thread Thread[vert.x-worker-thread-3,5,main] has been blocked for 64134 ms, time limit is 60000
io.vertx.core.VertxException: Thread blocked
What does the 3,5,main indicate? Is it some kind of trace from the main verticle? Thanks.
ThreadName: vert.x-worker-thread-3
ThreadPriority: 5
Source: main
I have an akka (Java) application with camel-jetty consumer. Under some minimum load (about 10 TPS), our client starts seeing HTTP 503 error. I tried to reproduce the problem in our lab, and it seems jetty can't handle overlapping HTTP requests. Below is the output from apache bench (ab):
ab sends 10 requests using one single thread (i.e. one request at a time)
ab -n 10 -c 1 -p bad.txt http://192.168.20.103:8899/pim
Benchmarking 192.168.20.103 (be patient).....done
Server Software: Jetty(8.1.16.v20140903)
Server Hostname: 192.168.20.103
Server Port: 8899
Document Path: /pim
Document Length: 33 bytes
Concurrency Level: 1
Time taken for tests: 0.61265 seconds
Complete requests: 10
Failed requests: 0
Requests per second: 163.23 [#/sec] (mean)
Time per request: 6.126 [ms] (mean)
Time per request: 6.126 [ms] (mean, across all concurrent requests)
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 1.0 1 2
Processing: 3 4 1.8 5 7
Waiting: 2 4 1.8 5 7
Total: 3 5 1.9 6 8
Percentage of the requests served within a certain time (ms)
50% 6
66% 6
75% 6
80% 8
90% 8
95% 8
98% 8
99% 8
100% 8 (longest request)
ab sends 10 requests using two threads (up to 2 requests at the same time):
ab -n 10 -c 2 -p bad.txt http://192.168.20.103:8899/pim
Benchmarking 192.168.20.103 (be patient).....done
Server Software: Jetty(8.1.16.v20140903)
Server Hostname: 192.168.20.103
Server Port: 8899
Document Path: /pim
Document Length: 33 bytes
Concurrency Level: 2
Time taken for tests: 30.24549 seconds
Complete requests: 10
Failed requests: 1
(Connect: 0, Length: 1, Exceptions: 0)
// obmited for clarity
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.9 1 2
Processing: 3 3005 9492.9 4 30023
Waiting: 2 3005 9492.7 3 30022
Total: 3 3006 9493.0 5 30024
Percentage of the requests served within a certain time (ms)
50% 5
66% 5
75% 7
80% 7
90% 30024
95% 30024
98% 30024
99% 30024
100% 30024 (longest request)
I don't believe jetty is this bad. Hopefully, it's just a configuration issue. This is the setting for my camel consumer URI:
"jetty:http://0.0.0.0:8899/pim?replyTimeout=70000&autoAck=false"
I am using akka 2.3.12 and camel-jetty 2.15.2
Jetty is certain not that bad and should be able to handle 10s of thousands of connections with many thousands of TPS.
Hard to diagnose from what you have said, other than Jetty does not send 503's when it is under load.... unless perhaps if the Denial of Service protection filter is deployed? (and ab would look like a DOS attack.... which it basically is and is not a great load generator for benchmarking).
So you need to track down who/what is sending that 503 and why.
It was my bad code: the sender (client) info was overwritten with overlapping requests. The 503 error message was sent due to Jetty continuation timeout.
I am trying to benchmark Qpid with the following use case:
Default Qpid configs are used(ex: 2GB is the max memory set), broker and
client are on the same machine
I have 1 connection and 256 sessions per connection, each sessions has a
producer and consumer. So, there are 256 producers and 256 consumers
All the producers/consumers are created before they start
producing/consuming messages. Each producer/consumer is a thread and they
run parallely
Consumers start consuming(they wait with .receive()). All consumer are durableSubscribers
producers start producing messages, each producer produces only 1
message, so there are 256 messages produced in total
A fanout exchange is used(topic.fanout=fanout://amq.fanout//fanOutTopic),
and there are 256 consumers, each consumer receives 256 messages and so
there are 256*256 messages received in total
Following are the response times(RT's) for the messages:
Response time is defined as the difference in the time when the
message is sent to the broker and the time at which the message is received
at the consumer
min: 144.0 ms
max: 350454.0 ms
average: 151933.02 ms
stddev: 113347.89 ms
95th percentile: 330559.0 ms
Is there any thing that I am doing wrong fundamentally?. I am worried about
the avg response times of "152 secs". Is this expected from qpid?. I see
a pattern here, as the test is running the RT's are increasing linearly
over time.
Thank you,
Siva.
I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.