Kubernetes pod (Java) restarts with 137 TERMINATED

Kubernetes pod (Java) restarts with 137 TERMINATED - java

We are running Kubernetes (1.18) with Docker 1.19 & systemd on an on-prem deployment with 3 masters and 3 workers. OS is RedHat 7.8.
Container is a Java 13 based spring boot app (using base image as openjdk:13-alpine) and below are the memory settings.
Pod:
memory - min 448M and max 2500M
cpu - min 0.1
Container:
Xms: 256M, Xmx: 512M
When traffic is send for a longer time, the container suddenly restarts; and in Prometheus I can see the Pod memory is below the max level (only around 1300MB).
In the pod events I can see warnings for liveness and readiness probes; and the pod getting restarted.
State: Running
Started: Sun, 23 Aug 2020 15:39:13 +0530
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Sun, 23 Aug 2020 15:23:03 +0530
Finished: Sun, 23 Aug 2020 15:39:12 +0530
Ready: True
Restart Count: 14
What are logs that I can refer to figure out why a restart was triggered? Application log is not helping at all; after the last log of the running app; I can see the starting line of the log as the next line.
What are the recommended approaches to troubleshoot this?
Thanks

137 means 128 + 9 (so it was killed with kill -9)
https://tldp.org/LDP/abs/html/exitcodes.html
Have a look at the pod and application logs.
Maybe the container needs more resources to start?

Related

Kubernetes Startup probe when endpoint is not reachable

I have the following deployment config. The test-worker-health and health endpoints are both unreachable as the application is failing due to an error. The startup probe keeps restarting the container after failing as restartPolicy: Always. The pods enter CrashLoopBackoff state. Is there a way to fail such startup probe?
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8080
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 30
startupProbe:
httpGet:
path: /test-worker-health
port: 8080
failureThreshold: 12
periodSeconds: 10

The startup probe keeps restarting the container after failing
The startupProbe does not restart your container, but the livenessProbe does.
The pods enter CrashLoopBackoff state. Is there a way to fail such startup probe?
If you remove the livenessProbe, you will not get this restart-behavior. You may want to use a readinessProbe instead?
Is there a way to fail such startup probe?
What do you mean? It is already "failing" as you say. You want automatic rollback? That is provided by e.g. Canary Deployment, but is a more advanced topic.

According to your configuration, the startupProbe is tried within 120seconds after which it fails if it doesn't succeed atleast once during that period.
If your application requires more time to start up .i.e > 120seconds, then the startupProbe would keep restarting your application before it's booted up completely.
I'd suggest increasing the failureThreshold to afford your application sufficient time to boot-up.

Prevent Spring Boot application closing until all current requests are finished

We have a Spring Boot (2.0.4) application exposing a number of endpoints, one of which enables clients to retrieve sometimes very large files (~200 GB). The application is exposed in a Pod via a Kubernetes deployment configured with the rolling-update strategy.
When we update our deployment by setting the image to the latest version the pods get destroyed and new ones spun up. Our service provision is seamless for new requests. However current requests can and do get severed and this can be annoying for clients in the middle of downloading very large files.
We can configure Container Lifecycle Pre-Stop hooks in our deployment spec to inject a pause before sending shutdown signals to the app via it's PID. This helps prevent any new traffic going to pods which have been set to Terminate. Is there a way to then pause the application shutdown process until all current requests have been completed (this may take tens of minutes)?
Here's what we have tried from within the Spring Boot application:
Implementing a shutdown listener which intercepts ContextCloseEvents; unfortunately we can't reliably retrieve a list of active requests. Any Actuator metrics which may have been useful are unavailable at this stage of the shutdown process.
Count active sessions by implementing a HttpSessionListener and overriding sessionCreated/Destroy methods to update a counter. This fails because the methods are not invoked on a separate thread so always report the same value in the shutdown listener.
Any other strategy we should try? From within the app itself, or the container, or directly through Kubernetes resource descriptors? Advice/Help/Pointers would be much appreciated.
Edit: We manage the cluster so we're only trying to mitigate service outages to currently connected clients during a managed update of our deployment via a modified pod spec

You could increase the terminationGracePeriodSeconds, the default is 30 seconds. But unfortunately, there's nothing to prevent a cluster admin from force deleting your pod, and there's all sorts of reasons the whole node could go away.

We did a combination of the above to resolve our problem.
increased the terminationGracePeriodSeconds to the absolute maximum we expect to see in production
added livenessProbe to prevent Traefik routing to our pod too soon
introduced a pre-stop hook injecting a pause and invoking a monitoring script:
Monitored netstat for ESTABLISHED connections to our process (pid 1) with a Foreign Address of our cluster Traefik service
sent TERM to pid 1
Note that because we send TERM to pid 1 from the monitoring script the pod will terminate at this point and the terminationGracePeriodSeconds never gets hit (it's there as a precaution)
Here's the script:
#!/bin/sh
while [ "$(/bin/netstat -ap 2>/dev/null | /bin/grep http-alt.*ESTABLISHED.*1/java | grep -c traefik-ingress-service)" -gt 0 ]
do
sleep 1
done
kill -TERM 1
Here's the new pod spec:
containers:
- env:
- name: spring_profiles_active
value: dev
image: container.registry.host/project/app:##version##
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 5 && /monitoring.sh
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 20
timeoutSeconds: 3
name: app
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 2
memory: 2Gi
imagePullSecrets:
- name: app-secret
serviceAccountName: vault-auth
terminationGracePeriodSeconds: 86400

Try to Gracefully Shutdown your Spring Boot Application.
This might help :
https://dzone.com/articles/graceful-shutdown-spring-boot-applications
This implementation will make sure that none of your active connections are killed and application will gracefully wait for them to finish before the shutdown.

Docker container disappeared and job too slow

I have a multi-threaded data-processing job that completes in around 5 hours (same code) on an EC2 instance. But when it is run on a docker container (I configured it to have 7 GB of RAM before creating the container), the job runs slowly in docker container for about 12+ hours and then docker container disappeared. How can we fix this ? Why should the job be very slow in the docker container? CPU processing was very very slow in the docker container, not just the network I/O. Network I/O being slow is fine. But I 'm wondering what could be the cause for the CPU processing being very slow compared to EC2 instance. Also where can I find the detailed trace of what happened in the host operating system to cause the docker container to die.
**docker logs <container_id>**
19-Feb-2019 22:49:42.098 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]
19-Feb-2019 22:49:42.105 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["ajp-nio-8009"]
19-Feb-2019 22:49:42.106 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in 27468 ms
19-Feb-2019 22:55:12.122 INFO [localhost-startStop-2] org.apache.catalina.startup.HostConfig.deployDirectory Deploying web application directory [/usr/local/tomcat/webapps/logging]
19-Feb-2019 22:55:12.154 INFO [localhost-startStop-2] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory [/usr/local/tomcat/webapps/logging] has finished in [32] ms
searchResourcePath=[null], isSearchResourceAvailable=[false]
knowledgeCommonResourcePath=[null], isKnowledgeCommonResourceAvailable=[false]
Load language resource fail...
blah blah blah some application log
bash: line 1: 10 Killed /usr/local/tomcat/bin/catalina.sh run
Error in Tomcat run: 137 ... failed!
Up on doing dmesg -T | grep docker, this is what I see. What is 500 dockerd? -500 docker-proxy? How to interpret what is here under?
[Tue Feb 19 14:49:04 2019] docker0: port 1(vethc30f313) entered blocking state
[Tue Feb 19 14:49:04 2019] docker0: port 1(vethc30f313) entered forwarding state
[Tue Feb 19 14:49:04 2019] docker0: port 1(vethc30f313) entered disabled state
[Tue Feb 19 14:49:07 2019] docker0: port 1(vethc30f313) entered blocking state
[Tue Feb 19 14:49:07 2019] docker0: port 1(vethc30f313) entered forwarding state
**[Wed Feb 20 04:09:23 2019] [10510] 0 10510 197835 12301 111 0 -500 dockerd
[Wed Feb 20 04:09:23 2019] [11241] 0 11241 84733 5434 53 0 0 docker
[Wed Feb 20 04:09:23 2019] [11297] 0 11297 29279 292 18 0 -500 docker-proxy**
[Wed Feb 20 04:09:30 2019] docker0: port 1(vethc30f313) entered disabled state
[Wed Feb 20 04:09:30 2019] docker0: port 1(vethc30f313) entered disabled state
[Wed Feb 20 04:09:30 2019] docker0: port 1(vethc30f313) entered disabled state
At 04:09:23, From above, it shows 500 dockerd etc and from below, at 04:09:24 it does Kill 11369 Java process score etc. What does it mean? Did it not kill docker process? It killed Java process running inside the docker container?
demsg -T | grep java
Wed Feb 20 04:09:23 2019] [ 3281] 503 3281 654479 38824 145 0 0 java
[Wed Feb 20 04:09:23 2019] [11369] 0 11369 3253416 1757772 4385 0 0 java
[Wed Feb 20 04:09:24 2019] Out of memory: Kill process 11369 (java) score 914 or sacrifice child
[Wed Feb 20 04:09:24 2019] Killed process 11369 (java) total-vm:13013664kB, anon-rss:7031088kB, file-rss:0kB, shmem-rss:0kB

TL;DR you need to increase the memory on your VM/host, or reduce the memory usage of your application.
The OS is killing Java which is running inside the container because the host ran out of memory. When the process inside the container dies, the container itself goes into an exited state. You can see these non-running containers with docker ps -a.
By default, docker does not limit the CPU or memory of a container. You can add these limits on containers, and if your container exceeds the container memory limits, docker will kill the container. That result will be visible with an OOM status when you inspect the stopped container.
The reason you see ether -500 lines setup on the docker processes is to prevent the OS from killing docker itself when the host runs out of memory. Instead, the process inside the container gets killed, and you can have a restart policy configured in docker to restart that container.
You can read more about memory limits, and configuring the OOM score for container processes at: https://docs.docker.com/engine/reference/run/

Deploying a website application done in java on cybera

I have a website that consists of .xml, .jsp, .servlet, .java, persistence and a database. I make a server on Cybera as shown below:
Using username "ubuntu".
Authenticating with public key "imported-openssh-key"
Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-121-generic x86_64)
Documentation: https://help.ubuntu.com/
System information as of Tue Mar 27 18:55:56 UTC 2018
System load: 0.0 Processes: 83
Usage of /: 4.5% of 19.65GB Users logged in: 1
Memory usage: 3% IP address for eth0: 10.1.10.0
Swap usage: 0%
Graph this data and manage this system at:
https://landscape.canonical.com/
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
0 packages can be updated.
0 updates are security updates.
Cloud Image Helper Scripts
To enable automatic updates please run:
/usr/local/bin/enableAutoUpdate
To install the latest OpenStack tools please run:
/usr/local/bin/installOpenStackTools
To use the local software update proxy please run:
/usr/local/bin/localSUS
To remove this message from your message of the day please run:
sudo rm /etc/motd
Last login: Tue Mar 27
ubuntu#kinstance:~$
I am wondering if there is a way to deploy the website using this method. I am still playing around with Cybera to get it running but I thought I could use some guidance if my method is just plain off.

Tomcat silently dies resulting is 502 server error - running as container

Tomcat version: 8.0.47
Tomcat is running as a container using: "FROM tomcat:8.0.47" in the Dockerfile.
Tomcat starts up and is able to serve my web application with no problems.
After approximately 7-10 minutes, I get a 502 server error:
Error: Server Error
The server encountered a temporary error and could not complete your
request.
Please try again in 30 seconds.
When I check the container ps -ax, I still see tomcat process running.
When I check catalina logs, the last log line: INFO [main] org.apache.catalina.startup.Catalina.start Server startup in 72786 ms
When I check application-specific logs, there are no errors.
My expectation: How do I figure out what is wrong and why is it silently dying whilst the process is still active? There is 7 minutes of proof that the application & tomcat was deployed correctly since it works with no problems in the first 7 minutes of start up?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.