Prometheus data gaps when monitoring app under heavy load

Prometheus data gaps when monitoring app under heavy load - java

spring boot + spring integration app monitored by prometheus throught the build in micrometer.io. the spring boot app will expose locahost:8080/actuator/prometheus. the monitoring data arrives in prometheus and can be displayed as a graph. this is working fine.
my problem is that i get some gaps in the prometheus data. these gaps happen when the app is under heavy load. it is normal when the app is very busy that the response times for locahost:8080/actuator/prometheus get longer. in my case without load is less then 1 second, but with load gets around 1 minute. the target is shown in the prometheus status->targets as offline. one possibility would be to set the scrape_interval = 2min but it would be important to see more detail info.
my question: is there a solution for this scenario? (setting priority to monitoring url?, storing temporary the info in the spring boot app and send it later)
update: i am trying to monitor the spring integration metrics, but for this question is not important which metric. could be anything like jvm heap.

Under normal circumstances querying the metrics endpoint using is quite fast.
There are three scenarios that came to my mind that could be the reason why its getting slower:
a) your app is so much under heavy load that it takes too much time until it accepts the http request. This means that your app serves too many requests then it can handle. In that case give it more resources, threads or whatever is the bottleneck. (see here)
b) you have custom gauges registered that needs lots of time to calculate or the get the value. E.g. having a DB query in a Gauge getter function is a killer, as every time the metric endpoint is queried, your app needs to query the database and only then it can render the metrics. Even worse if you have multiple of these (which are handled sequencially) and their performance is dependent on your applications load (e.g. when the DB server gets slower when your app is under heavy load, this will make it worse)
c) Your metrics labels cardinality are dependent on your application usage (which as a bad practice). E.g. having a label for each user or each session will increase the amount of metrics when your application is under heavy usage. This will not only stress your application (as each metric needs some memory) but it will also stress your Prometheus server as it creates files for each unique label value combination.
What you could do, but this will not solve the cause of your issues is increasing the value for scrape_timeout (see here).

Related

How to mimic traffic (TPS 9.05) of production Spring Boot REST API using Apache Jmeter performance testing?

I have an API (Spring Boot/Spring Web using Swagger) that has a throughput (TPS?) of 9.05 (not sure how this is being calculated, but its displayed on some metrics page). The API gets hit thousands of times per hour, sometimes peaking at 9,000 calls. Average response time is anywhere between ~2000-3000ms, approximately. This is a simple API that accepts a POST request and then queries a Postgres Database and returns this data as an HTTP response to client. This API is containerized via Docker and running on an ECS cluster on AWS (m5.large2) instance.
Instance Size vCPU Memory(GiB) Instance Storage(GiB) Network Bandwidth(Gbps) EBS Bandwidth (Mbps)
m5a.2xlarge 8 32 EBS-Only Up to 10 Up to 2,880
I have apache Jmeter installed and I am trying to mimic the production API calls to lower environments so I can fine-tune some CPU and memory configurations of our Docker containers running in AWS Elastic Container Service (ECS).
I am currently running 5 Threads, with 1/sec ramp-up, and 900 second duration time -
Is there a systematic way to I can replicate the traffic load in the lower environments so I can reproduce PROD load so I can correctly fine-tune CPU and memory?

As per Performance Testing in Scaled Down Environments. Part One: The Challenges article:
An application’s underlying infrastructure is constructed of many different components such as caches, web servers, application servers and disks(I/O). Bandwidth and CDNs also play a role in its function and therefore have to be taken into consideration during scaling. Each component behaves differently in the application according to how it was configured and scaled. However, the tiered structure makes it difficult to calculate how each should be tested and scaled.
Furthermore, there are two ways to scale the application. Scaling-up adds supplementary resources, like CPUs and memory, to a single computer. Scaling-out clusters additional computers together as one system to generate combined computing power. All of these options make it almost impossible to estimate actual data from performance testing in a smaller environment.
So there is no formula of extrapolation the behaviour of "lower environment" in comparison to production-like environment, I would say you're quite limited in what you can do, for example:
Run a Soak Test, this way you will be able to determine memory leaks
Run a test with a profiler tool telemetry enabled and inspect the longest running functions, largest objects, garbage collection activity, etc.
Monitor database slow queries and inspect their query plans for optimization in case of high cardinality/cost

How to manage Java Spring applications autoscaling in Kubernetes PROPERLY?

I'm trying to set up autoscaling in Kubernetes (hosted in Google Kubernetes Engine) for my Java Spring application. I have faced two problems:
Spring application uses a lot of cpu at the start (something like 250mCPU*, but sometimes it is even 500mCPU) which really breaks autoscaling, because some instances of that application, after more or less than 1 minute (Spring context start etc.), use only 50mCPU.
Because at some environments that aplication uses small amount of mCPU (and almost at every environment at night), I would like to set requested cpu=200mCPU max (=80% limit cpu) (or even less!). So then autoscaling would have much more sense. But I can't really do that, because of that heavy start of Spring, which won't be finished if i give him too less cpu.
When application starts receiving traffic (when new pod is created because of autoscaling event) at the beginning its cpu usage can jump to something like 200% of standard usage, and then go back to that 100% - it doesn't look like it's because of too many request are being pushed to that new pod, it looks more like JVM is just slower at the start and he receives too much traffic at the begging. It looks like JVM would need something like warm up (so don't push 1/n of traffic to new pod suddenly, but switch traffic to that new pod slower). Thanks to that behaviour autoscaling sometimes get crazy - when it really needs just one pod more, it can scale up a lot of them, and then scale down...
* in GKE 1000mCPU = 1 core
On uploaded images we can see cpu charts.
In the first, we can see that cpu usage after start is much smaller than at the beginning. In the second, we can spot both problems: high cpu usage at the start, then grace period (readiness probe initial* delay hasn't finished), and then high pick at the beginning of receiving traffic.
* I have set readiness probe initial delay to be longer than context loading.
Chart 1 Chart 2
The only thing that I've found in the internet is to add container to that pod, which will do nothing but "sleep x", and then die. And add set to that container requested mCPU to amount which will be used at spring app startup (then I would have to increase cpu limit for that spring app container, but it shouldn't harm anyway, because autoscaling should prevent spring app from starving other apps in the node).
I would really appreciate any advice.

It is true that Spring applications are not the most container friendly thing out there but there are few things you can try:
On startup, Spring autowires the beans and performs dependency injection, creates objects in memory, etc. All of those things are CPU intensive. If you assign less CPU to your pod, it will logically increase the startup time.
Things you can do here are:
Use a startupProbe and give time to your application to start. It
is explained pretty good here on how to calculate the delays and
thresholds
Adjust the maxSurge and maxUnavailable in your deployment
strategy as it fits best to your case (for example, maybe you have 10
replicas and max surge /max unavailable of 10% so your pods will
rollout slowly, one by one). This will help to reduce spikes in
traffic on the overall application replicas (docs are here).
If your use case allows, you can look into lazy loading your Spring
application, meaning that it will not create all objects upon
startup, rather it will wait until they are used. This can be
somewhat dangerous due to potentially not being able to discover
issues on startup in some cases.
If you have HPA enabled + defined replicas value in the deployment, you might experience issues upon deploying, I can't find the relevant GH issue ATM but you might want to run some tests there on how it behaves (scaling more than it should, etc). Things you can do here are:
Tweak the autoscaling thresholds and times (default is 3min, afaik) to allow your deployments to rollout smoothly without triggering the autoscale.
Write a custom autoscaling metric instead of scaling by CPU. This one requires some work but might solve your scaling issues for good (relevant docs).
Lastly, what you are suggesting with a sidecar looks like a hack :)
Haven't tried it though so can't really tell the pros and cons.
Unfortunately, there is no silver bullet for Spring Boot (or Java) + K8s but things are getting better than they were a few years back. If I find some helpful resources. I will come back and link them here.
Hope the above helps.
Cheers

The load in some Google App Engine instances and very small in others. Why?

As you can see in the following snapshot the load in some of the dynamic instances is huge (more than 20k requests) while in other are very small.
Why is this happening? Shouldn't GAE distribute uniformly the load??

If the load would be balanced across the active dynamic instances then they'd rarely become idle (only when the entire app's traffic would drop to almost nothing) thus it'd be difficult to dynamically shut them down.
More info here:
https://cloud.google.com/appengine/docs/scaling#scaling_dynamic_instances
https://cloud.google.com/appengine/docs/managing-resources#instances

This is what I got from a Google App Engine expert:
App Engine request scheduling uses several heuristics for routing requests to application instances. At low QPS it stays in affinity scheduling mode and routes majority of requests to instances that have most recently responded to the health check and handled requests successfully. That would explain why you see this variation in number of requests for each instance. As you ramp up the application traffic, load should even out across all instances.
I also asked what was the policy GAE follow to shut down the instances. I see that many of them are up even if they are not receiving any request
Dynamic instances that are not serving requests get garbage collected eventually. However, you only get billed for 15 additional minutes after they serve the last request. Please refer to this doc for additional information on instance billing.
https://cloud.google.com/appengine/kb/billing#different_on_demand_instance_resident

Decision to go for distributed application?

I have a legacy product in financial domain.Using tomcat 6. We get millions of request 10k of request in hour. I am wondering at high level
should i go for ditributed application where my mvc component is on one system and service/dao on another box(can use spring remote/EJB).
The reason i am planning to go in this direction so that load is distribute and get better performance With this it becomes scalable also.
I only see the positive side of it but somehow not able to figure out what can be the negative aspect of it?
If some expert can help
what is the criteria i should consider to go for distributed model and pros/cons of it? I also tried googling where i could get some stats
like how much load a given webserver (tomcat in my case)handle efiiciently with given hardware(16 gb ram, windows 7, processor ).
Yes i am going
to do POC where i will be measuring performance with distributed model vs without bit high level input will be highly appreciated?

It is impossible to answer this questions without more details - how long does it take to reply to one request on the current server? How many resources are allocated for one request?
having 10k requests per hour means ~3 requests per second. If performing the necessary operations and replying to a request, using 1 CPU takes ~300ms - one simple machine is totally fine. This is simple math, and doesn't always work. I guess you still have peaks within those 10k requests per hour and they aren't gradually distributed.
If we assume, one reply can take up to 1 second, than you can handle as many replies per second as your system has CPUs (given that a CPU would be the bottle neck) If the CPU isn't the bottle neck for your application server, there's probably something wrong. You should set up the database(s) on a different machine and only perform computation tasks on the application server machine.
Especially in the financial sector with a legacy software, I wouldn't try splitting a running product. How old is the current server? I believe that a new Server should be cheaper than rewriting an application. Unless you expect 50-100k requests per hour very soon, I don't think, splitting up such small parts makes sense.
Instead - run it on an up to date server hardware, split application server and data storage and you should be fine.

I am wondering at high level if should i go for ditributed application where my mvc component is on one system and service/dao on another box(can use spring remote/EJB).
I'm not sure what you mean for "system" in this context, but if it means that you are planning to run your application in two servers,
one dedicated to presentation and other dedicated to business layer, take in mind that a simpler approach (and probably more suitable for your app)
is build a co-located architecture.
Basically, the idea is to replicate your app in several servers (at least two) and put in front of them a load balancer that routes the incoming requests among the available servers.
All servers share the same database instance. This will give you vertical scalability and also will improve the availability of your system.
I only see the positive side of it but somehow not able to figure out what can be the negative aspect of it?
Distributing your business logic will probably involve a refactor of your application code, if the system is working well you will add some bugs for sure.
The necessary remote calls will add latency and the fact that you execute your business logic in several servers doesn't resolve the performance problems on the presentation tier.
In Expert One-on-One J2EE Development Without EJB (pag. 65), you can find a good reading about why not distribute your business logic.

How to properly throttle web requests to external systems?

My Java web application pulls some data from external systems (JSON over HTTP) both live whenever the users of my application request it and batch (nightly updates for cases where no user has requested it). The data changes so caching options are likely exhausted.
The external systems have some throttling in place, the exact parameters of which I don't know, and which likely change depending on system load (e.g., peak times 10 requests per second from one IP address, off-peak times 100 requests per second from open IP address). If the requests are too frequent, they time out or return HTTP 503.
Right now I am attempting the request 5 times with 2000ms delay between each, giving up if an error is received each time. This is not optimal as sometimes at peak-times nearly all requests fail; I could avoid making these requests and perhaps get at least some to succeed instead.
My goals are to have a somewhat simple, reliable design, and enough flexibility so that I could both pull some metrics from the throttler to understand how well the external systems are responding (and thus adjust how often they are invoked), and to auto-adjust the interval with which I call them (individually per system) so that it is optimal both on off-peak and peak hours.
My infrastructure is Java with RabbitMQ over MongoDB over Linux.
I'm thinking of three main options:
Since I already have RabbitMQ used for batch processing, I could just introduce a queue to which the web processes would send the requests they have for external systems, then worker processes would read from that queue, throttle themselves as needed, and return the results. This would allow running multiple parallel worker processes on more servers if needed. My main concern is that it isn't a very simple solution, and how to manage peak-hour throughput being low and thus the web processes waiting for a long while. Also this converts my RabbitMQ into a critical single failure point; if it dies the whole system stops (as opposed to the nightly batch processes just not running any more, which is less critical). I suppose rpc is the correct pattern of RabbitMQ usage, but not sure. Edit - I've posted a related question How to properly implement RabbitMQ RPC from Java servlet web container? on how to implement this.
Introduce nginx (e.g. ngx_http_limit_req_module), HAProxy (link) or other proxy software to the mix (as reverse proxies?), have them take care of the throttling through some configuration magic. The pro is that I don't have to make code changes. The con is that it is more technology used, and one I've not used before, so chances of misconfiguring something are quite high. It would also likely not be easy to do dynamic throttling depending on external server load, or prioritizing live requests over batch requests, or get statistics of how the throttling is doing. Also, most documentation and examples will likely be on throttling incoming requests, not outgoing.
Do a pure-Java solution (e.g., leaky bucket implementation). Would be simple in the sense that it is "just code", but the devil is in the details; debugging all the deadlocks, starvations and race conditions isn't always fun.
What am I missing here?
Which is the best solution in this case?
P.S. Somewhat related question - what's the proper approach to log all the external system invocations, so that statistics are collected as to how often I invoke them, and what the success rate is?
E.g., after every invocation I'd invoke something like .logExternalSystemInvocation(externalSystemName, wasSuccessful, elapsedTimeMills), and then get some aggregate data out of it whenever needed.
Is there a standard library/tool to use, or do I have to roll my own?
If I use option 1. with RabbitMQ, is there a way to organize the flow so that I get this out of the box from the RabbitMQ console? I wouldn't want to send all failed messages to poison queue, it would fill up too quickly though and in most cases there is no need to re-process these failed requests as the user has already sadly moved on.

Perhaps this open source system can help you a little: http://code.google.com/p/valogato/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.