HikariCP - Load testing degrades performance to a halt

HikariCP - Load testing degrades performance to a halt - java

I have been using HikariCP on my spring boot application and I am starting to make some load tests with JMeter.
I noticed that the first time I run my tests, it goes well, and each request takes like 30ms or so.
But each time I run my tests again, against the same application instance, the response time gets worse, until it freezes and I get a whole lot of
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30019ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:583)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:186)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:145)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:112)
at sun.reflect.GeneratedMethodAccessor501.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at net.bull.javamelody.JdbcWrapper$3.invoke(JdbcWrapper.java:805)
at net.bull.javamelody.JdbcWrapper$DelegatingInvocationHandler.invoke(JdbcWrapper.java:286)
at com.sun.proxy.$Proxy102.getConnection(Unknown Source)
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:246)
... 108 common frames omitted
I even left the application iddle for a day and tryied again, but the tests show degraded performance and the same errors.
Only if I shutdown the application it can run my tests again, but only one load (1200+ requests).
When I was developing the tests I was running my local app with a H2 database and didn't notice any degradation until I deployed my application on a server running postgresql.
So to take that variable out of the way I left JMeter running on my local H2 app and the degradation showed.
Here is a test scenario I ran on my local app (H2 database), with default HikariCP poll size (10), using 10 threads. I manage to run 25000+ requests before the application stopped responding.
I plotted the requests:
Also, the tests consists of a request to a Spring Boot #RestController.
My controller calls a service that has #Transactional at the start (I call some legacy APIs that require a transaction to exist, so I open it right away).
So let's say I have my tests requesting this endpoint 10 times in parallel. Let's also say that my code might have other points annotated with #Transactional. Would a poll size of 10 be enough?
Also, should any poll size be enough, despite having poor performance, or is it "normal" to have this kind of scenario where the poll just get's too busy and "locks"?
I also tried increasing the poll size to 50 but the problem persists. It gets close to the previous 25000 requests from the previous tests (with 10 poll size) and fails like stated before.

HikariCP suggests to use a constant-size small pool saturated with threads waiting for connections. As per the docs the suggested pool size:
connections = ((core_count * 2) + effective_spindle_count)
A formula which has held up pretty well across a lot of benchmarks for years is
that for optimal throughput the number of active connections should be somewhere
near ((core_count * 2) + effective_spindle_count). Core count should not include
HT threads, even if hyperthreading is enabled. Effective spindle count is zero if
the active data set is fully cached, and approaches the actual number of spindles
as the cache hit rate falls. ... There hasn't been any analysis so far regarding
how well the formula works with SSDs.
An in-memory H2 with a small dataset will be faster than a standalone database running on a different server. Even if you are running in the same datacenter the round-trip between servers is usually around 0.5-1ms.
Try to find the current bottleneck first. If the application server doesn't run out of CPU then the problem is somewhere else e.g. database server. If you can't figure out where is the current bottleneck you may end up optimising in the wrong place.

So, it was a memory leak after all. Nothing to do with HikariCP.
We have some Groovy scripts using #Memoized with some really bad cache keys (huge objects), and that cache kept getting bigger until there was no memory left.

Related

Performance hit migrating from MongoDB Java Rx driver to reactive streams driver

We're trying to upgrade from the old RxJava-based Mongo driver mongodb-driver-rx (v1.5.0) to the newer mongodb-driver-reactivestreams (v1.13.1) - not the newest one because of dependencies, but certainly a lot newer. The old RxJava one has been end-of-life for years. Everything works correctly with the new driver, but under high load the performance is taking too big a hit and we can't explain why.
Some background info about our app:
Our (Java) app runs on AWS EC2 (at peak times around 30 m5.xlarge instances), and is based on a Vertx and RxJava stack. We are running a Mongo cluster (m5.12xlarge) with 1 primary and 2 secondaries. Typical number of simultaneous connections to Mongo at peak times is a few thousand. We have a gatling based load test in place which typically runs for 1 hour with 60 AWS EC2 instances, 1 primary Mongo and 2 secondaries like in production, and with 100k simultaneous users.
A few observations:
Microbenchmarking a simple piece of integration testing code (which does a few common db operations) indicates no significant performance difference between the old and new driver.
With the old driver we're seeing good performance overall in the load test, avg 20ms response time and 200ms response time within 99% percentile.
With the new driver, running the same load test, things explode (over 2000ms avg response time, and eventually over 60% failed requests due to waiting queues getting full).
If we run the load test with only 1 EC2 instance and 1.6k simultaneous users (which is the same load per instance), there is no significant performance difference between the old and new driver, and things run relatively smoothly.
MongoDB driver settings:
clusterSettings = "{hosts=[localhost:27017], mode=MULTIPLE, requiredClusterType=UNKNOWN, requiredReplicaSetName='null', serverSelector='LatencyMinimizingServerSelector{acceptableLatencyDifference=15 ms}', clusterListeners='[]', serverSelectionTimeout='30000 ms', localThreshold='30000 ms', maxWaitQueueSize=500, description='null'}"
connectionPoolSettings = "ConnectionPoolSettings{maxSize=100, minSize=0, maxWaitQueueSize=50000, maxWaitTimeMS=5000, maxConnectionLifeTimeMS=0, maxConnectionIdleTimeMS=300000, maintenanceInitialDelayMS=0, maintenanceFrequencyMS=60000, connectionPoolListeners=[]}"
heartbeatSocketSettings = "SocketSettings{connectTimeoutMS=10000, readTimeoutMS=10000, keepAlive=true, receiveBufferSize=0, sendBufferSize=0}"
readPreference = "primary"
serverSettings = "ServerSettings{heartbeatFrequencyMS=10000, minHeartbeatFrequencyMS=500, serverListeners='[]', serverMonitorListeners='[]'}"
socketSettings = "SocketSettings{connectTimeoutMS=10000, readTimeoutMS=0, keepAlive=true, receiveBufferSize=0, sendBufferSize=0}"
sslSettings = "SslSettings{enabled=false, invalidHostNameAllowed=true, context=null}"
writeConcern = "WriteConcern{w=null, wTimeout=null ms, fsync=null, journal=null"
Things we've tried:
(all to no avail)
Switching Mongo db version (we are currently still on 3.6, but we've tried 4.0 too);
Adding a Vertx based RxJava scheduler around every db operation (we've tried Schedulers.io(), and RxHelper.scheduler(vertx))
Configuring Mongo settings with a AsynchronousSocketChannelStreamFactoryFactory containing a AsynchronousChannelGroup with fixed threadpool of size 100;
Configuring Mongo settings with a NettyStreamFactoryFactory containing a NioEventLoopGroup;
Playing around with the maximum Mongo connection pool per instance (varying from 100 to 500);
Things that cannot help us for now:
(we know these, some of these are on our roadmap, but they would be too time consuming for now)
Better index management (we've already optimized this, there are no queries that use an inefficient collscan)
Splitting up the app into smaller services
Easing the load on Mongo by employing in-memory JVM caching (Guava) or remote caching (Redis) - we already do this to some extent
Getting rid of Vertx in favor of, for instance, Spring Boot
It seems like it is some kind of pooling or threading issue, but we can't pinpoint the exact problem, and profiling this kind of problem is also very hard.
Any thoughts on what may cause the problem and how to fix it?

This is probably not the answer you are looking for, but why don't you use the official Vert.x client for MongoDB?
(Cannot comment because of low reputation)

Postgres vacuum/demon partially working when issued from JDBC

First of all I know it's odd to rely on a manual vacuum from the application layer, but this is how we decided to run it.
I have the following stack :
HikariCP
JDBC
Postgres 11 in AWS
Now here is the problem. When we start fresh with brand new tables with autovacuum=off the manual vacuum is working fine. I can see the number of dead_tuples growing up to the threshold then going back to 0. The tables are being updated heavily in parallel connections (HOT is being used as well). At some point the number of dead rows is like 100k jumping up to the threshold and going back to 100k. The n_dead_tuples slowly creeps up.
Now the worst of all is that when you issue vacuum from a pg console ALL the dead tuples are cleaned, but oddly enough when the application is issuing vacuum it's successful, but partially cleans "threshold amount of records", but not all ?
Now I am pretty sure about the following:
Analyze is not running, nor auto-vacuum
There are no long running transactions
No replication is going on
These tables are "private"
Where is the difference between issuing a vacuum from the console with auto-commit on vs JDBC ? Why the vacuum issued from the console is cleaning ALL the tupples whereas the vacuum from the JDBC cleans it only partially ?
The JDBC vacuum is ran in a fresh connection from the pool with the default isolation level, yes there are updates going on in parallel, but this is the same as when a vacuum is executed from the console.
Is the connection from the pool somehow corrupted and can not see the updates? Is the ISOLATION the problem?
Visibility Map corruption?
Index referencing old tuples?
Side-note: I have observed that same behavior with autovacuum on and cost limit through the roof like 4000-8000 , threshold default + 5% . At first the n_dead_tuples is close to 0 for like 4-5 hours... The next day the table is 86gigs with milions of dead tuples. All the other tables are vacuumed and ok...
PS: I will try to log a vac verbose in the JDBC..
PS2: Because we are running in AWS could it be a backup that is causing it to stop cleaning ?
PS3: When refering to vaccum I mean simple vacuum, not full vacuum. We are not issuing full vacuum.

The main problem was that vacuum is run by another user. The vacuuming that I was seeing was the HOT updates + selects running over that data resulting in on-the-fly vacuum of the page.
Next: Vacuuming is affected by long running transactions ACROSS ALL schemas and tables. Yes, ALL schemas and tables. Changing to the correct user fixed the vacuum, but it will get ignored if there is an open_in_transaction in any other schema.table.
Work maintance memory helps, but in the end when the system is under heavy load all vacuuming is paused.
So we upgraded the DB's resources a bit and added a monitor to help us if there are any issues.

Hazelcast IMap#get periodic extreme latency

We are experiencing very inconsistent performance when doing an IMap.get() on a particular Hazelcast map.
Our Hazelcast cluster is running version 3.8, has 8 members, and we connect to the cluster as a Hazelcast client. The map we are experiencing problems with has a backup count of 1.
We've isolated the slow operation to single IMap.get operation with logging on both sides of that line of code. The get normally takes milliseconds, but for a few keys it takes between 30 and 50 seconds. We can do numerous get operations on the same map and they all return quickly except for the same few keys. The particular map is relatively small, only about 2000 entries, and is of type <String,String>
If we restart a member in the cluster, we still experience the same latency but with different keys. This seems to indicate an issue with a particular member as the cluster re-balances when we stop/start a member. We've tried stopping each member individually and testing but experience the same symptoms with each member stopped in isolation. We’ve also tried reducing and increasing the number of members in the cluster but experience the same symptoms regardless.
We've confirmed with thread dumps that the generic operation threads are not blocked and have tried increasing the number of operation threads as well as enabling parallization but see no change in behavior. We've also enabled diagnostic logging in the cluster and don't see any obvious issues (no slow operations reported).
Looking at Hazelcast JMX MBeans, the maxGetLatency on the particular map is only about 1 second, much lower than what we are actually experiencing. This seems to indicate an issue with the client connection or underlying network. However, the number of slow keys is only about 1% of the total keys, so unless we are way out of balance, the issue again doesn't seem to be with a single member as you would expect about 1 in 8 keys to be slow. We've also confirmed from the Hazelcast logs that the cluster is stable. Members are not dropping out and rejoining.
Interestingly, if we stop and restart the whole cluster, we get good performance initially but after a few minutes it degrades back to the same state where a few specific IMap.get operations take 30+ seconds.
This exact code is not new and has been running just fine for quite a while. However, once this behavior started, it is consistently reproducible here. As far as we know, there have been no environmental changes.
Is there any diagnostic logging we can enable to get insight about the Hazelcast client? Are there any other diagnostic options available to track down where this latency is coming from? Unfortunately we are not able to reproduce this in any other environment which does seem to point at something either environmental or something unique to the cluster in this environment.
One other potentially interesting thing is that we see the following log statement every 6 seconds in each of the cluster members. The "backup-timeouts:1" is concerning but we aren't sure what it means.
INFO: [IP]:[PORT] [CLUSTER_NAME] [3.8] Invocations:1 timeouts:0 backup-timeouts:1
Any ideas or suggestions on how to debug this further would be very much appreciated.

Copy-pasted from https://github.com/hazelcast/hazelcast/issues/7689
*InvocationFuture.get() has a built-in timeout when remote node doesn't respond at all. It doesn't wait forever. That timeout is defined by system property hazelcast.operation.call.timeout.millis.
When remote node doesn't respond in time, invocation fails with OperationTimeoutException.
Those timeouts are generally because of network problem between caller and remote or system pauses due to high load (GC pauses, OS freezes, IO latency etc).
You can decrease hazelcast.operation.call.timeout.millis to a lower value and enable diagnostics reports to see detailed metrics of the system.*
-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30
http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#diagnostics
In my case the cause is waiting on incoming tcp connections on the host.

Weblogic 12c data source high count connection

So, we have already deploying an application, which consist a heavy business logic that my company uses. After some time, the performance was quite slower than before, actually in the weblogic data source configuration, we set the maximum connection to only 100, but recently it keeps on increasing until its limit.
We reconfigure the data source to 200, but it keeps on increasing, this is not ideal, because 100 is the max connection that we want it to be deployed.
Meanwhile, there were some thread stuck in the server too. But i think it's not the problem. Do someone knows why is this occuring so suddenly? (after implementation of a newer yet stable version, they said)

From the screenshot attached I can see that Active Connection Count is ~80. Also I can see connection are being leaked.
Enable Inactive Connection Timeout by defining some value (Based on avg time taken to execute statement).
Make sure that all JDBC connections are closed in your code after using it.

Tomcat org.apache.catalina.connector.requestfacade.getsession() takes more than 44.7% CPU resources

I have built a stateless java servlet web application and the requirement is to accept at least 5000 transaction per second (with 150 concurrent thread). I am using ehcache together with SQL server 2005 to avoid writing to the slow harddisk.
In the performance test (with Jmeter 150 threads), i only manage to score roughly 2800 transactions per second (less than half of expected). When i take a sampler inside the JVisualVM, i notice that:
org.apache.catalina.connector.requestfacade.getsession() <-- take more than 44.7% of CPU time
Any idea what does the requestfacade.getsession() doing and is there a way to speed it up? while i must optimise my code, i still need to figure out what does the above line doing else 5000 per second is practically impossible.
Tomcat conf:
-single Tomcat instance (6.0.23)
-Using Connectir executor, with 150 maxThread
Server conf:
-Windows 2008
-xeon quad core
-8GB ram
-1TB raid 5 HDD
Any help is must appreciated!

If your servlet is truly stateless, why is it accessing the session?
:-)

If you're working this in a stateless fashion, see about configuring tomcat to not create a session by default.
Also if you are using JSP, make sure it is set to not create a session.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.