I have the following deployment config. The test-worker-health and health endpoints are both unreachable as the application is failing due to an error. The startup probe keeps restarting the container after failing as restartPolicy: Always. The pods enter CrashLoopBackoff state. Is there a way to fail such startup probe?
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8080
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 30
startupProbe:
httpGet:
path: /test-worker-health
port: 8080
failureThreshold: 12
periodSeconds: 10
The startup probe keeps restarting the container after failing
The startupProbe does not restart your container, but the livenessProbe does.
The pods enter CrashLoopBackoff state. Is there a way to fail such startup probe?
If you remove the livenessProbe, you will not get this restart-behavior. You may want to use a readinessProbe instead?
Is there a way to fail such startup probe?
What do you mean? It is already "failing" as you say. You want automatic rollback? That is provided by e.g. Canary Deployment, but is a more advanced topic.
According to your configuration, the startupProbe is tried within 120seconds after which it fails if it doesn't succeed atleast once during that period.
If your application requires more time to start up .i.e > 120seconds, then the startupProbe would keep restarting your application before it's booted up completely.
I'd suggest increasing the failureThreshold to afford your application sufficient time to boot-up.
Related
I tried to do a deployment for some applications in spring dataflow,
Routinely each diploi takes a few minutes and passes successfully or fails.
But this time the diplomacy took longer than usual. At one point I pressed "undeploy"
Since the system does not respond.
Under Stream all flickers in UNKNOWN mode.
It is not possible to redeploy.
When I try to perform a dipole I get the error Failed to upload the package. Package [test-orders:1.0.0] in Repository [local] already exists. from the ui
When I request the status of the pods I get 2 pods with CrashLoopBackOff status
I rebooted all pods kubectl -n **** rollout restart deploy
I try to run dataflow:>stream undeploy --name test-orders
I deleted the new docker image from EKS
Changed skipper_status from FAILED to DELETED
The problem still exists.
I'm really at a loss.
OK,
I seem to have been able to solve the problem.
Due to the CrashLoopBackOff status I realized that the system is unable to pull the image or the image is corrupt.
I have overwritten all the images in EKS that are associated with the project.
I changed the problematic skipper_status.status_code to DELETED(update skipper_status set status_code = 'DELETED' where id =***).
In the skipper_release table I added
backoffLimit: 6
completions: 1
parallelism: 1
So a crash of the system after several attempts will result in the end of a run.
I did a reset for all the pods.
And then in the UI interface I pressed the undeploy button.
Edit 1
I noticed that there were pods left that did not close.
I closed them like this:
kubectl -n foobar delete deployment foo-bar-v1
We have a Spring Boot (2.0.4) application exposing a number of endpoints, one of which enables clients to retrieve sometimes very large files (~200 GB). The application is exposed in a Pod via a Kubernetes deployment configured with the rolling-update strategy.
When we update our deployment by setting the image to the latest version the pods get destroyed and new ones spun up. Our service provision is seamless for new requests. However current requests can and do get severed and this can be annoying for clients in the middle of downloading very large files.
We can configure Container Lifecycle Pre-Stop hooks in our deployment spec to inject a pause before sending shutdown signals to the app via it's PID. This helps prevent any new traffic going to pods which have been set to Terminate. Is there a way to then pause the application shutdown process until all current requests have been completed (this may take tens of minutes)?
Here's what we have tried from within the Spring Boot application:
Implementing a shutdown listener which intercepts ContextCloseEvents; unfortunately we can't reliably retrieve a list of active requests. Any Actuator metrics which may have been useful are unavailable at this stage of the shutdown process.
Count active sessions by implementing a HttpSessionListener and overriding sessionCreated/Destroy methods to update a counter. This fails because the methods are not invoked on a separate thread so always report the same value in the shutdown listener.
Any other strategy we should try? From within the app itself, or the container, or directly through Kubernetes resource descriptors? Advice/Help/Pointers would be much appreciated.
Edit: We manage the cluster so we're only trying to mitigate service outages to currently connected clients during a managed update of our deployment via a modified pod spec
You could increase the terminationGracePeriodSeconds, the default is 30 seconds. But unfortunately, there's nothing to prevent a cluster admin from force deleting your pod, and there's all sorts of reasons the whole node could go away.
We did a combination of the above to resolve our problem.
increased the terminationGracePeriodSeconds to the absolute maximum we expect to see in production
added livenessProbe to prevent Traefik routing to our pod too soon
introduced a pre-stop hook injecting a pause and invoking a monitoring script:
Monitored netstat for ESTABLISHED connections to our process (pid 1) with a Foreign Address of our cluster Traefik service
sent TERM to pid 1
Note that because we send TERM to pid 1 from the monitoring script the pod will terminate at this point and the terminationGracePeriodSeconds never gets hit (it's there as a precaution)
Here's the script:
#!/bin/sh
while [ "$(/bin/netstat -ap 2>/dev/null | /bin/grep http-alt.*ESTABLISHED.*1/java | grep -c traefik-ingress-service)" -gt 0 ]
do
sleep 1
done
kill -TERM 1
Here's the new pod spec:
containers:
- env:
- name: spring_profiles_active
value: dev
image: container.registry.host/project/app:##version##
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 5 && /monitoring.sh
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 20
timeoutSeconds: 3
name: app
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 2
memory: 2Gi
imagePullSecrets:
- name: app-secret
serviceAccountName: vault-auth
terminationGracePeriodSeconds: 86400
Try to Gracefully Shutdown your Spring Boot Application.
This might help :
https://dzone.com/articles/graceful-shutdown-spring-boot-applications
This implementation will make sure that none of your active connections are killed and application will gracefully wait for them to finish before the shutdown.
Tomcat version: 8.0.47
Tomcat is running as a container using: "FROM tomcat:8.0.47" in the Dockerfile.
Tomcat starts up and is able to serve my web application with no problems.
After approximately 7-10 minutes, I get a 502 server error:
Error: Server Error
The server encountered a temporary error and could not complete your
request.
Please try again in 30 seconds.
When I check the container ps -ax, I still see tomcat process running.
When I check catalina logs, the last log line: INFO [main] org.apache.catalina.startup.Catalina.start Server startup in 72786 ms
When I check application-specific logs, there are no errors.
My expectation: How do I figure out what is wrong and why is it silently dying whilst the process is still active? There is 7 minutes of proof that the application & tomcat was deployed correctly since it works with no problems in the first 7 minutes of start up?
Tomcat logs INFO: Server startup in 181667 ms but never in started state as it is in starting and synchronised state. I'm using eclipse Juno. I have gone through related posts but the issue is not resolved.
I get the following error:
Server Liferay v6.0 CE Server (Tomcat 6) at localhost was unable to start within 900 seconds.
If the server requires more time, try increasing the timeout in the server editor.
Please let know the cause for this scenario.
You can increase your tomcat start time.
Click on double click on on your server find timeout option and increase your start time (in seconds)
Let me know if you have problem
I have a problem when starting TOmcat 6 server from eclipse, it pops up with the following message:
Server Tomcat v6.0 Server at localhost was unable to start within 45 seconds. If the server requires more time, try increasing the timeout in the server editor.
It could be that it just takes too long to start-up. You can check that from your logs to see whether it was midway through when the default 45 secs cutt-off time occurred. If so, on the Servers view double-click on the tomcat instance you are deploying and in the Overview tab, click on the Timeouts section on the right column and give your tomcat a bit more time to startup.