I have a service which deletes and inserts records into tables from other tables based on the changes in data. When I run my service, It started transactions and seems working fine, after sometime it stuck at some point like it is waiting for some resources and after waiting for long hours it gives the connection time out exception. I checked with DBA and they cleared indexing and fragmentation on the tables and I also reduced no.of transactions at a time from 50k to 10K, no luck with any of these changes. I am trying to process around 3.8 millions of records on total.
Note: It was working fine with 2 cpu cores but used to take long hours to complete the run. So we increased 2 more cpu cores, after increasing the cores it worked fine for the 1st time, after that each time it is giving connection timeout exception.
Please check number of allowed active connection in sql server.
make sure you are closing your collection properly after every call.
Related
First of all I know it's odd to rely on a manual vacuum from the application layer, but this is how we decided to run it.
I have the following stack :
HikariCP
JDBC
Postgres 11 in AWS
Now here is the problem. When we start fresh with brand new tables with autovacuum=off the manual vacuum is working fine. I can see the number of dead_tuples growing up to the threshold then going back to 0. The tables are being updated heavily in parallel connections (HOT is being used as well). At some point the number of dead rows is like 100k jumping up to the threshold and going back to 100k. The n_dead_tuples slowly creeps up.
Now the worst of all is that when you issue vacuum from a pg console ALL the dead tuples are cleaned, but oddly enough when the application is issuing vacuum it's successful, but partially cleans "threshold amount of records", but not all ?
Now I am pretty sure about the following:
Analyze is not running, nor auto-vacuum
There are no long running transactions
No replication is going on
These tables are "private"
Where is the difference between issuing a vacuum from the console with auto-commit on vs JDBC ? Why the vacuum issued from the console is cleaning ALL the tupples whereas the vacuum from the JDBC cleans it only partially ?
The JDBC vacuum is ran in a fresh connection from the pool with the default isolation level, yes there are updates going on in parallel, but this is the same as when a vacuum is executed from the console.
Is the connection from the pool somehow corrupted and can not see the updates? Is the ISOLATION the problem?
Visibility Map corruption?
Index referencing old tuples?
Side-note: I have observed that same behavior with autovacuum on and cost limit through the roof like 4000-8000 , threshold default + 5% . At first the n_dead_tuples is close to 0 for like 4-5 hours... The next day the table is 86gigs with milions of dead tuples. All the other tables are vacuumed and ok...
PS: I will try to log a vac verbose in the JDBC..
PS2: Because we are running in AWS could it be a backup that is causing it to stop cleaning ?
PS3: When refering to vaccum I mean simple vacuum, not full vacuum. We are not issuing full vacuum.
The main problem was that vacuum is run by another user. The vacuuming that I was seeing was the HOT updates + selects running over that data resulting in on-the-fly vacuum of the page.
Next: Vacuuming is affected by long running transactions ACROSS ALL schemas and tables. Yes, ALL schemas and tables. Changing to the correct user fixed the vacuum, but it will get ignored if there is an open_in_transaction in any other schema.table.
Work maintance memory helps, but in the end when the system is under heavy load all vacuuming is paused.
So we upgraded the DB's resources a bit and added a monitor to help us if there are any issues.
I have been using HikariCP on my spring boot application and I am starting to make some load tests with JMeter.
I noticed that the first time I run my tests, it goes well, and each request takes like 30ms or so.
But each time I run my tests again, against the same application instance, the response time gets worse, until it freezes and I get a whole lot of
Caused by: java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30019ms.
at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:583)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:186)
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:145)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:112)
at sun.reflect.GeneratedMethodAccessor501.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at net.bull.javamelody.JdbcWrapper$3.invoke(JdbcWrapper.java:805)
at net.bull.javamelody.JdbcWrapper$DelegatingInvocationHandler.invoke(JdbcWrapper.java:286)
at com.sun.proxy.$Proxy102.getConnection(Unknown Source)
at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:246)
... 108 common frames omitted
I even left the application iddle for a day and tryied again, but the tests show degraded performance and the same errors.
Only if I shutdown the application it can run my tests again, but only one load (1200+ requests).
When I was developing the tests I was running my local app with a H2 database and didn't notice any degradation until I deployed my application on a server running postgresql.
So to take that variable out of the way I left JMeter running on my local H2 app and the degradation showed.
Here is a test scenario I ran on my local app (H2 database), with default HikariCP poll size (10), using 10 threads. I manage to run 25000+ requests before the application stopped responding.
I plotted the requests:
Also, the tests consists of a request to a Spring Boot #RestController.
My controller calls a service that has #Transactional at the start (I call some legacy APIs that require a transaction to exist, so I open it right away).
So let's say I have my tests requesting this endpoint 10 times in parallel. Let's also say that my code might have other points annotated with #Transactional. Would a poll size of 10 be enough?
Also, should any poll size be enough, despite having poor performance, or is it "normal" to have this kind of scenario where the poll just get's too busy and "locks"?
I also tried increasing the poll size to 50 but the problem persists. It gets close to the previous 25000 requests from the previous tests (with 10 poll size) and fails like stated before.
HikariCP suggests to use a constant-size small pool saturated with threads waiting for connections. As per the docs the suggested pool size:
connections = ((core_count * 2) + effective_spindle_count)
A formula which has held up pretty well across a lot of benchmarks for years is
that for optimal throughput the number of active connections should be somewhere
near ((core_count * 2) + effective_spindle_count). Core count should not include
HT threads, even if hyperthreading is enabled. Effective spindle count is zero if
the active data set is fully cached, and approaches the actual number of spindles
as the cache hit rate falls. ... There hasn't been any analysis so far regarding
how well the formula works with SSDs.
An in-memory H2 with a small dataset will be faster than a standalone database running on a different server. Even if you are running in the same datacenter the round-trip between servers is usually around 0.5-1ms.
Try to find the current bottleneck first. If the application server doesn't run out of CPU then the problem is somewhere else e.g. database server. If you can't figure out where is the current bottleneck you may end up optimising in the wrong place.
So, it was a memory leak after all. Nothing to do with HikariCP.
We have some Groovy scripts using #Memoized with some really bad cache keys (huge objects), and that cache kept getting bigger until there was no memory left.
I have a problem with my Oracle DB network speed.
First of all, what's the essence of the problem. There are java application on my computer and Oracle DB on a remote server. Connection speed between them is about 2,5MB/s. I execute in my java app
a very simple query like "select id, name from table_name", result set contains ~60K rows (size is about 1,5 Mb) and transfers to my app for ~80 seconds. Accordingly to the profiler the most of the time application spends in oracle.net.Packet.recieve method.
For comparison the same query executes in SQL Developer for 0,5-0,7 seconds for 5000 rows. Extrapolating to 60K rows we have about 6-8 seconds.
The result of excution of tcpdump for my application shows that data transfers in chunks with size about 200 bytes. On the other hand for SQL Developer tcpdump shows package size more than 2000 bytes.
Official Oracle documentations suggests to increase SDU and TDU parameters, unfortunately i can't change configuration of database, so i tried to determine them on client side in a such way:
jdbc:oracle:thin:#(DESCRIPTION=(SDU=11280)(TDU=11280)(ADDRESS=(PROTOCOL=tcp)(HOST=<host>)(PORT=1521)(SEND_BUF_SIZE=11784)(RECV_BUF_SIZE=11784))(CONNECT_DATA=(SERVICE_NAME=<db>)))
But this didn't bring any changes. Can database or ojdbc driver ignore this parameters? Or maybe i'm on the wrong way?
As it turned out, the reason was in fetch size. Increasing its value allows to decrease execution time at ~100 times.
I have a Java batch which does a select with a large resulset (I process the elements using a Spring callbackhandler).
The callbackhandler puts a task in a fixed threadpool to process the row.
My poolsize is fixed on 16 threads.
The resulset contains about 100k elements.
All db access code is handled through a JdbcTemplate or through Hibernate/Spring, no manual connection management is present.
I have tried with Atomikos and with Commons DBCP as connection pool.
Now, I would think that 17 max connections in my connectionpool would be enough to get this batch to finish. One for the select and 16 for the threads in the connectionpool which update some rows. However that seems to be too naive, as I have to specify a max pool size a magnitude larger (haven't tried for an exact value), first I tried 50 which worked on my local Windows machine, but doesn't seem to be enough on our Unix test environment. There I have to specify 128 to make it work (again, I didn't even try a value between 50 and 128, I went straight to 128).
Is this normal? Is there some fundamental mechanism in connection pooling I'm missing? I find it hard to debug this as I don't know how to see what happens with the opened connections. I tried various log4j settings but didn't get any satisfactory result.
edit, additional info: when the connectionpool size seems to be too low, the batch seems to hang. If I do a jstat on the process I can see all threads are waiting for a new connection. At first I didn't specify the maxWait property on the dbcp connection pool, which causes threads to wait indefinitely on a new connection, and I noticed the batch kept hanging. So no connections were released. However, that only happened after processing +-70k rows, which dismissed my initial hunch of connection leakage somehow.
edit2: I forgot to mention I already rewrote the update part in my tasks. I qeueu my updates in a ConcurrentLinkedQueue, I empty that on 1000 elements. So i actually only do about 100 updates.
edit3: I'm using Oracle and I am using the concurrent utils. So i have an executor configured with fixed poolsize of 16. I submit my tasks on this executor. I don't use connections manually in my tasks, I use jdbctemplate which is threadsafe and asks it connections from the connectionpool. I suppose Spring/DBCP handles the connection/thread issue.
If you are using linux, you can try MySql administrator to monitor you connection status graphically, provided you are using MySQL.
Irrespective of that, even 100 connections is not uncommon for large enterprise applications, handling a few thousand requests per minute.
But if the requests are low or each request doesnt need unique a transaction, then I would recommend you to tune your operation inside threads.
That is, how are you distributing the 100k elements to 16 threads?
If you try to acquire the connection every time you read a row from the shared location(or buffer), then it is expected to take time.
See whether this helps.
getConnection
for each element until the buffer size becomes zero
process it.
if you need to update,
open a transaction
update
commit/rollback the transaction
go to step 2
release the connection
you can synchronize the buffer by using java.util.concurrent collections
Dont use one Runnable/Callable for each element. This will degrade the performance.
Also how are you creating threads? use Executors to run your runnable/callable. Also remember that DB connections are NOT expected to be shared across threads. So use 1 connection in 1 thread at a time.
For eg. create an Executor and submit 16 runnalbles each having its own connection.
I switched to c3p0 instead of DBCP. In c3p0 you can specify a number of helper threads. I notice if I put that number as high as the number of threads I'm using, the number of connections stays really low (using the handy jmx bean of c3p0 to inspect the active connections). Also, I have several dependencies with each its own entity manager. Apparently a new connection is needed for each entity manager, so I have about 4 entitymanagers/thread, which would explain the high number of connections. I think my tasks are all so short-lived that DBCP couldn't follow with closing/releasing connections, since c3p0 works more asynchronous and you can specify the number of helperthreads, it is able to release my connections in time.
edit: but the batch keeps hanging when deployed to the test environment, all threads are blocking when releasing the connection, the lock is on the pool. Just the same as with DBPC :(
edit: all my problems dissapeared when I switched to BoneCP, and I got a huge performance increase as bonus too
Is there any way to guarantee that an application won't fail to release row locks in Oracle? If I make sure to put commit statements in finally blocks, that handles the case of unexpected errors, but what if the app process just suddenly dies before it commits (or someone kicks the power cord / lan cable out).
Is there a way to have Oracle automatically roll back idle sessions after X amount of time? Or roll back when I somehow detects that the connection was lost?
From the experiments I've done, if I terminate an app process before it commits, the rows locks stay forever until I log into the database and manually kill the session.
Thanks.
Try setting SQLNET.EXPIRE_TIME in your sqlnet.ora.
SQLNET.EXPIRE_TIME=10
From the documentation:
Purpose
To specify a time interval, in minutes, to send a check to verify that client/server connections are active.
COMMIT inside finally is probably the last thing you should do since you should (almost) never commit anything that threw an exception.
I am not a DBA so I am sure you can find a better solution...
but there are certain deadlock conditions that seem to happen that will not roll back on our own. My last DBA had a process that would run every minute and kill anything that had been running more than 10 minutes.