Couldn't acquire semaphore - Zuul Configuration - java

After reading the documentation on Spring cloud Zuul I have understood that with SEMAPHORE isolation and max concurrent requests of around 2, Zuul server can easily handle about 5000 rps. With a value of around 2000 it's throwing the following exception and showing error rate 100% when I tried to invoke the service from Jmeter.
com.netflix.hystrix.exception.HystrixRuntimeException:
Service1 could not acquire a semaphore for execution and no fallback available.
Then I have bumped up the number of max concurrent requests to 200000 but still it's throwing the exception but the error rate has gone down to 10%.
Can you please let me know the reason for this. Is this because of slow micro service or any configuration issue in Spring cloud Zuul ? The following is the configuration:
ribbon:
ConnectTimeout: 20000000
ReadTimeout: 20000000
MaxTotalHttpConnections: 5000
MaxHttpConnectionsPerHost: 5000
ActiveConnectionsLimit: 4000
hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds: 20000000
zuul.hystrix.command.default.execution.isolation.strategy: SEMAPHORE
zuul.hystrix.command.default.execution.isolation.semaphore.maxConcurrentRequests: 2000
zuul.hystrix.command.default.fallback.isolation.semaphore.maxConcurrentRequests: 2000
zuul.eureka.default.semaphore.maxSemaphores: 30000

After some tests, I found it should be:
zuul.semaphore.maxSemaphores: 30000
It is different from the issue in github. Maybe it is related with the version.

Related

jetty threads increasing linearly

All I have got an apache FUSION server and configured jetty for the same.
I can see using newrelic that the count of threads is increasing linearly. After a time these threads are increased to a limit and cause out of memory exception until I restart my proxy server.
Please find below the start.ini configs I did to regulate the number of threads.
--module=server
jetty.threadPool.minThreads=10
jetty.threadPool.maxThreads=150
jetty.threadPool.idleTimeout=5000
jetty.server.dumpAfterStart=false
jetty.server.dumpBeforeStop=false
jetty.httpConfig.requestHeaderSize=32768
etc/jetty-stop-timeout.xml
--module=continuation
--module=deploy
--module=jsp
--module=ext
--module=resources
--module=client
--module=annotations
--module=servlets
etc/jetty-logging.xml
--module=jmx
--module=stats
I tried adding thread enabled property too but it didn't work. Can anyone help how can I limit these threads? For the same configurations on other servers, I can see the threads are not increasing and are well in range on newrelic.

Prevent Spring Boot application closing until all current requests are finished

We have a Spring Boot (2.0.4) application exposing a number of endpoints, one of which enables clients to retrieve sometimes very large files (~200 GB). The application is exposed in a Pod via a Kubernetes deployment configured with the rolling-update strategy.
When we update our deployment by setting the image to the latest version the pods get destroyed and new ones spun up. Our service provision is seamless for new requests. However current requests can and do get severed and this can be annoying for clients in the middle of downloading very large files.
We can configure Container Lifecycle Pre-Stop hooks in our deployment spec to inject a pause before sending shutdown signals to the app via it's PID. This helps prevent any new traffic going to pods which have been set to Terminate. Is there a way to then pause the application shutdown process until all current requests have been completed (this may take tens of minutes)?
Here's what we have tried from within the Spring Boot application:
Implementing a shutdown listener which intercepts ContextCloseEvents; unfortunately we can't reliably retrieve a list of active requests. Any Actuator metrics which may have been useful are unavailable at this stage of the shutdown process.
Count active sessions by implementing a HttpSessionListener and overriding sessionCreated/Destroy methods to update a counter. This fails because the methods are not invoked on a separate thread so always report the same value in the shutdown listener.
Any other strategy we should try? From within the app itself, or the container, or directly through Kubernetes resource descriptors? Advice/Help/Pointers would be much appreciated.
Edit: We manage the cluster so we're only trying to mitigate service outages to currently connected clients during a managed update of our deployment via a modified pod spec
You could increase the terminationGracePeriodSeconds, the default is 30 seconds. But unfortunately, there's nothing to prevent a cluster admin from force deleting your pod, and there's all sorts of reasons the whole node could go away.
We did a combination of the above to resolve our problem.
increased the terminationGracePeriodSeconds to the absolute maximum we expect to see in production
added livenessProbe to prevent Traefik routing to our pod too soon
introduced a pre-stop hook injecting a pause and invoking a monitoring script:
Monitored netstat for ESTABLISHED connections to our process (pid 1) with a Foreign Address of our cluster Traefik service
sent TERM to pid 1
Note that because we send TERM to pid 1 from the monitoring script the pod will terminate at this point and the terminationGracePeriodSeconds never gets hit (it's there as a precaution)
Here's the script:
#!/bin/sh
while [ "$(/bin/netstat -ap 2>/dev/null | /bin/grep http-alt.*ESTABLISHED.*1/java | grep -c traefik-ingress-service)" -gt 0 ]
do
sleep 1
done
kill -TERM 1
Here's the new pod spec:
containers:
- env:
- name: spring_profiles_active
value: dev
image: container.registry.host/project/app:##version##
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 5 && /monitoring.sh
livenessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
periodSeconds: 20
timeoutSeconds: 3
name: app
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 60
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 2
memory: 2Gi
imagePullSecrets:
- name: app-secret
serviceAccountName: vault-auth
terminationGracePeriodSeconds: 86400
Try to Gracefully Shutdown your Spring Boot Application.
This might help :
https://dzone.com/articles/graceful-shutdown-spring-boot-applications
This implementation will make sure that none of your active connections are killed and application will gracefully wait for them to finish before the shutdown.

ZuulException Forwarding error,ClientException null

Am facing the below ZuulException exception due to SocketTimeoutException Read timed out. I am trying to put my oauth2 server behind the zuul proxy.
Please see the log trace here , gateway's application.yml entries here and application dependencies here . I am not using hystrix or eureka explicitly
This issue is intermittent, sometimes it is working and sometimes it isn't. Have anyone faced this before.
everything works well except API gateway.
Try to define the below properties. It seems that you're using Zuul with Eureka. In this case, RibbonRoutingFilter will be used instead of SimpleHostRoutingFilter. If so, you need to define readTimeout and connectTimeout for ribbon instead of zuul.host properties.
ribbon:
ReadTimeout: 10000
ConnectTimeout: 10000
oauth2:
ribbon:
ReadTimeout: 10000
ConnectTimeout: 10000

Debezium flush timeout and OutOfMemoryError errors with MySQL

Using Debezium 0.7 to read from MySQL but getting flush timeout and OutOfMemoryError errors in the initial snapshot phase. Looking at the logs below it seems like the connector is trying to write too many messages in one go:
WorkerSourceTask{id=accounts-connector-0} flushing 143706 outstanding messages for offset commit [org.apache.kafka.connect.runtime.WorkerSourceTask]
WorkerSourceTask{id=accounts-connector-0} Committing offsets [org.apache.kafka.connect.runtime.WorkerSourceTask]
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
WorkerSourceTask{id=accounts-connector-0} Failed to flush, timed out while waiting for producer to flush outstanding 143706 messages [org.apache.kafka.connect.runtime.WorkerSourceTask]
Wonder what the correct settings are http://debezium.io/docs/connectors/mysql/#connector-properties for sizeable databases (>50GB). I didn't have this issue with smaller databases. Simply increasing the timeout doesn't seem like a good strategy. I'm currently using the default connector settings.
Update
Changed the settings as suggested below and it fixed the problem:
OFFSET_FLUSH_TIMEOUT_MS: 60000 # default 5000
OFFSET_FLUSH_INTERVAL_MS: 15000 # default 60000
MAX_BATCH_SIZE: 32768 # default 2048
MAX_QUEUE_SIZE: 131072 # default 8192
HEAP_OPTS: '-Xms2g -Xmx2g' # default '-Xms1g -Xmx1g'
This is a very complex question - first of all, the default memory settings for Debezium Docker images are quite low so if you are using them it might be necessary to increase them.
Next, there are multiple factors at play. I recommend to do follwoing steps.
Increase max.batch.size and max.queue.size - reduces number of commits
Increase offset.flush.timeout.ms - gives Connect time to process accumulated records
Decrease offset.flush.interval.ms - should reduce the amount of accumulated offsets
Unfortunately there is an issue KAFKA-6551 lurking in backstage that can still play a havoc.
I can confirm that the answer posted above by Jiri Pechanec solved my issues. This is the configurations I am using:
kafka connect worker configs set in worker.properties config file:
offset.flush.timeout.ms=60000
offset.flush.interval.ms=10000
max.request.size=10485760
Debezium configs passed through the curl request to initialize it:
max.queue.size = 81290
max.batch.size = 20480
We didn't run into this issue with our staging MySQL db (~8GB), because the dataset is a lot smaller. For production dataset (~80GB) , we had to adjust these configurations.
Hope this helps.
To add onto what Jiri said:
There is now an open issue in the Debezium bugtracker, if you have any more information about root causes, logs or reproduction, feel free to provide them there.
For me, changing the values that Jiri mentioned in his comment did not solve the issue. The only working workaround was to create multiple connectors on the same worker that are responsible for a subset of all tables each. For this to work, you need to start connector 1, wait for the snapshot to complete, then start connector 2 and so on. In some cases, an earlier connector will fail to flush when a later connector starts to snapshot. In those cases, you can just restart the worker once all snapshots are completed and the connectors will pick up from the binlog again (make sure your snapshot mode is "when_needed"!).

How to perform a failover with Netflix/Eureka?

I use Eureka as my service discovery and as a load balancer, and it is working fine when having two instances of a service "A", however when I stop one of this instances Eureka doesn't recognize that one of the instances is down and it keeps me showing an error page everytime the load balancer tries to use the dead instance.
I have put the enableSelfPreservation to false to prevent that but it takes Eureka up to 3 - 5 minutes to unregister that service, however I want high availability over my service and I want to perform the failover immediately in a matter of seconds. Is this possible using Eureka, if not how can I achieve to use only the alive instances when the others are dead.
I am using spring boot, here is my configuration for the Eureka server.
server:
port: 8761
eureka:
instance:
hostname: localhost
client:
registerWithEureka: false
fetchRegistry: false
serviceUrl:
defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
server:
enableSelfPreservation: false
You should add a ribbon configuration to your application.yml. It is also recommended to set the hystrix isolation level to THREAD with a timeout set.
Note: This configuration should be in the client side (this usually means your gateway server), since Ribbon (and Spring Cloud in general) are a form of client-side load balancing.
Here's an example that I use:
hystrix:
command:
default:
execution:
isolation:
strategy: THREAD
thread:
timeoutInMilliseconds: 40000 #Timeout after this time in milliseconds
ribbon:
ConnectTimeout: 5000 #try to connect to the endpoint for 5 seconds.
ReadTimeout: 5000 #try to get a response after successfull connection for 5 seconds
# Max number of retries on the same server (excluding the first try)
maxAutoRetries: 1
# Max number of next servers to retry (excluding the first server)
MaxAutoRetriesNextServer: 2

Categories

Resources