We have a Spring Boot (2.0.4) application exposing a number of endpoints, one of which enables clients to retrieve sometimes very large files (~200 GB). The application is exposed in a Pod via a Kubernetes deployment configured with the rolling-update strategy.
When we update our deployment by setting the image to the latest version the pods get destroyed and new ones spun up. Our service provision is seamless for new requests. However current requests can and do get severed and this can be annoying for clients in the middle of downloading very large files.
We can configure Container Lifecycle Pre-Stop hooks in our deployment spec to inject a pause before sending shutdown signals to the app via it's PID. This helps prevent any new traffic going to pods which have been set to Terminate. Is there a way to then pause the application shutdown process until all current requests have been completed (this may take tens of minutes)?
Here's what we have tried from within the Spring Boot application:
Implementing a shutdown listener which intercepts ContextCloseEvents; unfortunately we can't reliably retrieve a list of active requests. Any Actuator metrics which may have been useful are unavailable at this stage of the shutdown process.
Count active sessions by implementing a HttpSessionListener and overriding sessionCreated/Destroy methods to update a counter. This fails because the methods are not invoked on a separate thread so always report the same value in the shutdown listener.
Any other strategy we should try? From within the app itself, or the container, or directly through Kubernetes resource descriptors? Advice/Help/Pointers would be much appreciated.
Edit: We manage the cluster so we're only trying to mitigate service outages to currently connected clients during a managed update of our deployment via a modified pod spec
You could increase the terminationGracePeriodSeconds, the default is 30 seconds. But unfortunately, there's nothing to prevent a cluster admin from force deleting your pod, and there's all sorts of reasons the whole node could go away.
We did a combination of the above to resolve our problem.
increased the terminationGracePeriodSeconds to the absolute maximum we expect to see in production
added livenessProbe to prevent Traefik routing to our pod too soon
introduced a pre-stop hook injecting a pause and invoking a monitoring script:
Monitored netstat for ESTABLISHED connections to our process (pid 1) with a Foreign Address of our cluster Traefik service
sent TERM to pid 1
Note that because we send TERM to pid 1 from the monitoring script the pod will terminate at this point and the terminationGracePeriodSeconds never gets hit (it's there as a precaution)
Here's the script:
while [ "$(/bin/netstat -ap 2>/dev/null | /bin/grep http-alt.*ESTABLISHED.*1/java | grep -c traefik-ingress-service)" -gt 0 ]
sleep 1
kill -TERM 1
Here's the new pod spec:
- env:
- name: spring_profiles_active
value: dev
image: container.registry.host/project/app:##version##
imagePullPolicy: Always
- /bin/sh
- -c
- sleep 5 && /monitoring.sh
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 20
timeoutSeconds: 3
name: app
- containerPort: 8080
path: /actuator/health
port: 8080
initialDelaySeconds: 60
cpu: 2
memory: 2Gi
cpu: 2
memory: 2Gi
- name: app-secret
serviceAccountName: vault-auth
terminationGracePeriodSeconds: 86400
Try to Gracefully Shutdown your Spring Boot Application.
This might help :
This implementation will make sure that none of your active connections are killed and application will gracefully wait for them to finish before the shutdown.
I tried to do a deployment for some applications in spring dataflow,
Routinely each diploi takes a few minutes and passes successfully or fails.
But this time the diplomacy took longer than usual. At one point I pressed "undeploy"
Since the system does not respond.
Under Stream all flickers in UNKNOWN mode.
It is not possible to redeploy.
When I try to perform a dipole I get the error Failed to upload the package. Package [test-orders:1.0.0] in Repository [local] already exists. from the ui
When I request the status of the pods I get 2 pods with CrashLoopBackOff status
I rebooted all pods kubectl -n **** rollout restart deploy
I try to run dataflow:>stream undeploy --name test-orders
I deleted the new docker image from EKS
Changed skipper_status from FAILED to DELETED
The problem still exists.
I'm really at a loss.
I seem to have been able to solve the problem.
Due to the CrashLoopBackOff status I realized that the system is unable to pull the image or the image is corrupt.
I have overwritten all the images in EKS that are associated with the project.
I changed the problematic skipper_status.status_code to DELETED(update skipper_status set status_code = 'DELETED' where id =***).
In the skipper_release table I added
backoffLimit: 6
completions: 1
parallelism: 1
So a crash of the system after several attempts will result in the end of a run.
I did a reset for all the pods.
And then in the UI interface I pressed the undeploy button.
Edit 1
I noticed that there were pods left that did not close.
I closed them like this:
kubectl -n foobar delete deployment foo-bar-v1
I'm new to Docker and am currently struggling with containerizing my dropwizard application. Each time I build the container, run it, and check the logs, I get the MySQL connection failure error which makes sense as the container runs on a virtual machine and for it the localhost URL means nothing. I was wondering what can I do to make my MySQL accessible inside my docker container. Thanks. This is how my config.yml file looks like rn.
driverClass: com.mysql.cj.jdbc.Driver
# the username
user: root
# the password
# the JDBC URL
url: jdbc:mysql://localhost:3306/locations?useLegacyDatetimeCode=false&serverTimezone=UTC
# any properties specific to your JDBC driver:
charSet: UTF-8
# the maximum amount of time to wait on an empty pool before throwing an exception
maxWaitForConnection: 1s
# the SQL query to run when validating a connection's liveness
validationQuery: "/* MyService Health Check */ SELECT 1"
# the timeout before a connection validation queries fail
validationQueryTimeout: 3s
# the minimum number of connections to keep open
minSize: 8
# the maximum number of connections to keep open
maxSize: 32
# whether or not idle connections should be validated
checkConnectionWhileIdle: false
# the amount of time to sleep between runs of the idle connection validation, abandoned cleaner and idle pool resizing
evictionInterval: 10s
# the minimum amount of time an connection must sit idle in the pool before it is eligible for eviction
minIdleTime: 1 minute
# Logging settings.
# level: INFO
# loggers:
# io.dropwizard: DEBUG
# org.eclipse.jetty.servlets: DEBUG
# org.hibernate.SQL: ALL
# com.udemy.LocationsApplication:
# level: ALL,
# additive: false
# appenders:
# - type: conso
# logFormat: "%red(CDR) [%magenta(%date)] [%thread] [%cyan(%logger{0})]: %message%n"
# appenders:
# - type: console
# logFormat: "%highlight(%-5level) [%magenta(%date)] [%thread] [%cyan(%logger{0})]: %message%n" ```
Assuming you have a mysql container exposing the port 3306 and your dropwizard container.
You can create a network for them
docker network create <network_name>
And assign it to the dockers
docker run .... --network <network_name>
This should make that bot dockers see each other
You can see the networks you have using docker network ls if you use docker-compose it will generate the networks automatically.
You can also use the host network which also makes them able to connect to ports in your machine (the virtual machine if you are running docker in one of those)
I have the following deployment config. The test-worker-health and health endpoints are both unreachable as the application is failing due to an error. The startup probe keeps restarting the container after failing as restartPolicy: Always. The pods enter CrashLoopBackoff state. Is there a way to fail such startup probe?
failureThreshold: 3
path: /health
port: 8080
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 30
path: /test-worker-health
port: 8080
failureThreshold: 12
periodSeconds: 10
The startup probe keeps restarting the container after failing
The startupProbe does not restart your container, but the livenessProbe does.
The pods enter CrashLoopBackoff state. Is there a way to fail such startup probe?
If you remove the livenessProbe, you will not get this restart-behavior. You may want to use a readinessProbe instead?
Is there a way to fail such startup probe?
What do you mean? It is already "failing" as you say. You want automatic rollback? That is provided by e.g. Canary Deployment, but is a more advanced topic.
According to your configuration, the startupProbe is tried within 120seconds after which it fails if it doesn't succeed atleast once during that period.
If your application requires more time to start up .i.e > 120seconds, then the startupProbe would keep restarting your application before it's booted up completely.
I'd suggest increasing the failureThreshold to afford your application sufficient time to boot-up.
Unable to deregister the service from Consul.
Basically Consul official page said that it will deregister service automatically but in my case it won't work like said.
Hi Referring to the consul Life-cycle its says that
To prevent an accumulation of dead nodes (nodes in either failed or left states), Consul will automatically remove dead nodes out of the catalog. This process is called reaping. This is currently done on a configurable interval of 72 hours (changing the reap interval is not recommended due to its consequences during outage situations). Reaping is similar to leaving, causing all associated services to be deregistered.
This is my bootstrap.yml file
port: 8089
name: ***-service
host: consul-ui
port: 8500
deregister: true
instance-id: ${spring.application.name}:${random.value}
enabled: true
register: true
health-check-interval: 20s
prefer-ip-address: true
enabled: true
prefix: configuration
defaultContext: shared
format: YAML
data-key: data
enabled: true
enabled: true
In Consul UI after deleting service using purge command, still shows on Consul UI. So meant that it is not been deregister from Consul
You need to configure this on Consul, as your apps seems to not exit gracefully.
Checkout the consul config property deregister_critical_service_after:
In Consul 0.7 and later, checks that are associated with a service may
also contain an optional deregister_critical_service_after field,
which is a timeout in the same Go time format as interval and ttl. If
a check is in the critical state for more than this configured value,
then its associated service (and all of its associated checks) will
automatically be deregistered. The minimum timeout is 1 minute, and
the process that reaps critical services runs every 30 seconds, so it
may take slightly longer than the configured timeout to trigger the
deregistration. This should generally be configured with a timeout
that's much, much longer than any expected recoverable outage for the
given service.
That documentation is about consul nodes, not service nodes.
How exactly are you terminating the application?
The README in this repo has been updated to demonstrate the solution in the accepted answer.
I'm working with a simple example of a Spring Boot Eureka service registration and discovery based on this guide.
If I start up one client instance, it registers properly, and it can see itself through the DiscoveryClient. If I start up a second instance with a different name, it works as well.
But if I start up two instances with the same name, the dashboard only shows 1 instance running, and the DiscoveryClient only shows the second instance.
When I kill the 2nd instance, the 1st one is visible again through the dashboard and the discovery client.
Here are some more details about the steps I'm taking and what I'm seeing:
Eureka Server
Start the server
cd eureka-server
mvn spring-boot:run
Visit the Eureka dashboard at http://localhost:8761
Note that there are no 'Instances' yet registered
Eureka Client
Start up a client
cd eureka-client
mvn spring-boot:run
Visit the client directly at http://localhost:8080/
The /whoami endpoint will show the client's self-knowledge of its application name and port
The /instances endpoint will take up to a minute to update, but should eventually show all the instances of eureka-client that have been registered with the Eureka Discovery Client.
You can also visit the Eureka dashoboard again now and see it listed there.
Spin up another client with a different name
You can see that another client will be registred by doing the following:
cd eureka-client
mvn spring-boot:run -Dspring.application.name=foo -Dserver.port=8081
The /whoami endpoint will show the name foo and the port 8081.
In a minute or so, the /instances endpoint will show the information about this foo instance too.
On the Eureka dashboard, two clients will now be registered.
Spin up another client with the same name
Now try spinning up another instance of eureka-client by only over-riding the port parameter:
cd eureka-client
mvn spring-boot:run -Dserver.port=8082
The /whoami endpoint for http://localhost:8082 shows what we expect.
In a minute or so, the /instances endpoint now shows the instance running on port 8082 also, but for some reason, it doesn't show the instance running on port 8080.
And if we check the /instances endpoint on http://localhost:8080 we also now only see the instance running on 8082 (even though clearly, the one on 8080 is running since that's what we're asking for.
The Eureka dashboard only shows 1 instance of eureka-client running.
What's going on here?
Let's try killing the instance running on 8082 and see what happens.
When we query /instances on 8080, it still only shows the instance on 8082.
But a minute later, that goes away and we just see the instance on 8080 again.
The question is, why don't we see both instances of eureka-client when they are both running?
For local deployments, try to configure {namespace}.instanceId property in eureka-client.properties (or eureka.instance.metadataMap.instanceId for proper yaml file in case of Spring Cloud based setup). It's deeply rooted in the way Eureka server calculates application lists and compares InstanceInfo for the PeerAwareInstanceRegistryImpl - when no more concrete data (e.g.: instance metadata is available) they try to get the id from the hostname..
I wouldn't recommend it for AWS deployment though, cause messing around with instanceId will bring you trouble figuring out which machine hosts a particular service - on the other hand I doubt that you'll hosts two identical services on one machine, right?
In order to get all instances show up in the admin portal by setting unique euraka.instance.hostname in your Eureka configuration file.
The hostname is used as key for storing the InstanceInfo in com.netflix.discovery.shared.Application (since no UniqueIdentifier is set). So you have to use unique hostnames. When you test ribbon in this scenario you would see that the load won't be balanced.
Following application.yml is example:
port: ${PORT:0}
component: example.server
com.netflix.discovery: 'OFF'
org.springframework.cloud: 'DEBUG'
leaseRenewalIntervalInSeconds: 1
leaseExpirationDurationInSeconds: 1
instanceId: ${spring.application.name}:${spring.application.instance_id:${random.value}}
instanceId: ${spring.application.name}:${spring.application.instance_id:${random.value}}
It's a bug before in Eureka, you can check further information in https://github.com/codecentric/spring-boot-admin/issues/134