So, we have already deploying an application, which consist a heavy business logic that my company uses. After some time, the performance was quite slower than before, actually in the weblogic data source configuration, we set the maximum connection to only 100, but recently it keeps on increasing until its limit.
We reconfigure the data source to 200, but it keeps on increasing, this is not ideal, because 100 is the max connection that we want it to be deployed.
Meanwhile, there were some thread stuck in the server too. But i think it's not the problem. Do someone knows why is this occuring so suddenly? (after implementation of a newer yet stable version, they said)
From the screenshot attached I can see that Active Connection Count is ~80. Also I can see connection are being leaked.
Enable Inactive Connection Timeout by defining some value (Based on avg time taken to execute statement).
Make sure that all JDBC connections are closed in your code after using it.
Related
First of all I know it's odd to rely on a manual vacuum from the application layer, but this is how we decided to run it.
I have the following stack :
HikariCP
JDBC
Postgres 11 in AWS
Now here is the problem. When we start fresh with brand new tables with autovacuum=off the manual vacuum is working fine. I can see the number of dead_tuples growing up to the threshold then going back to 0. The tables are being updated heavily in parallel connections (HOT is being used as well). At some point the number of dead rows is like 100k jumping up to the threshold and going back to 100k. The n_dead_tuples slowly creeps up.
Now the worst of all is that when you issue vacuum from a pg console ALL the dead tuples are cleaned, but oddly enough when the application is issuing vacuum it's successful, but partially cleans "threshold amount of records", but not all ?
Now I am pretty sure about the following:
Analyze is not running, nor auto-vacuum
There are no long running transactions
No replication is going on
These tables are "private"
Where is the difference between issuing a vacuum from the console with auto-commit on vs JDBC ? Why the vacuum issued from the console is cleaning ALL the tupples whereas the vacuum from the JDBC cleans it only partially ?
The JDBC vacuum is ran in a fresh connection from the pool with the default isolation level, yes there are updates going on in parallel, but this is the same as when a vacuum is executed from the console.
Is the connection from the pool somehow corrupted and can not see the updates? Is the ISOLATION the problem?
Visibility Map corruption?
Index referencing old tuples?
Side-note: I have observed that same behavior with autovacuum on and cost limit through the roof like 4000-8000 , threshold default + 5% . At first the n_dead_tuples is close to 0 for like 4-5 hours... The next day the table is 86gigs with milions of dead tuples. All the other tables are vacuumed and ok...
PS: I will try to log a vac verbose in the JDBC..
PS2: Because we are running in AWS could it be a backup that is causing it to stop cleaning ?
PS3: When refering to vaccum I mean simple vacuum, not full vacuum. We are not issuing full vacuum.
The main problem was that vacuum is run by another user. The vacuuming that I was seeing was the HOT updates + selects running over that data resulting in on-the-fly vacuum of the page.
Next: Vacuuming is affected by long running transactions ACROSS ALL schemas and tables. Yes, ALL schemas and tables. Changing to the correct user fixed the vacuum, but it will get ignored if there is an open_in_transaction in any other schema.table.
Work maintance memory helps, but in the end when the system is under heavy load all vacuuming is paused.
So we upgraded the DB's resources a bit and added a monitor to help us if there are any issues.
So I've been tracking a bug for a day or two now which happens out on a remote server that I have little control over. The ins and outs of my code are, I provide a jar file to our UI team, which wraps postgres and provides storage for data that users import. The import process is very slow due to multiple reasons, one of which is that the users are importing unpredictable, large amounts of data (which we can't really cut down on). This has lead to a whole plethora of time out issues.
After some preliminary investigation, I've narrowed it down to the jdbc to the postgres database is timing out. I had a lot of trouble replicating this on my local test setup, but have finally managed to by reducing the 'socketTimeout' of the connection properties to 10s (there's more than 10s between each call made on the connection).
My question now is, what is the best way to keep this alive? I've set the 'tcpKeepAlive' to true, but this doesn't seem to have an effect, do I need to poll the connection manually or something? From what I've read, I'm assuming that polling is automatic, and is controlled by the OS. If this is true, I don't really have control of the OS settings in the run environment, what would be the best way to handle this?
I was considering testing the connection each time it is used, and if it has timed out, I will just create a new one. Would this be the correct course of action or is there a better way to keep the connection alive? I've just taken a look at this post where people are suggesting that you should open and close a connection per query:
When my app loses connection, how should I recover it?
In my situation, I have a series of sequential inserts which take place on a single thread, if a single one fails, they all fail. To achieve this I've used transactions:
m_Connection.setAutoCommit(false);
m_TransactionSave = m_Connection.setSavepoint();
// Do something
m_Connection.commit();
m_TransactionSave = null;
m_Connection.setAutoCommit(true);
If I do keep reconnecting, or use a connection pool like PGBouncer (like someone suggested in comments), how do I persist this transaction across them?
JDBC connections to PostGres can be configured with a keep-alive setting. An issue was raised against this functionality here: JDBC keep alive issue. Additionally, there's the parameter help page.
From the notes on that, you can add the following to your connection parameters for the JDBC connection:
tcpKeepAlive=true;
Reducing the socketTimeout should make things worse, not better. The socketTimeout is a measure of how long a connection should wait when it expects data to arrive, but it has not. Making that longer, not shorter would be my instinct.
Is it possible that you are using PGBouncer? That process will actively kill connections from the server side if there is no activity.
Finally, if you are running on Linux, you can change the TCP keep alive settings with: keep alive settings. I am sure something similar exists for Windows.
I have a memory leak in two apps in Tomcat 6.0.35 server that appeared "out of nowhere". One app is Solr and the other is our own software. I'm hoping someone has seen this before as it's been happening to me for the last few weeks and I have to keep restarting Tomcat in a production environment.
It appeared on our original server despite the fact that none of the code related to thread or DB connection operation has been touched. As the old server this app runs on was due to be retired I migrated the site to a new server and a "cleaner" environment with the idea that would clear out any legacy stuff. But it continues to happen.
Just before Tomcat shuts down the catalina.out log is filled with errors like:
2012-04-25 21:46:00,300 [main] ERROR org.apache.catalina.loader.WebappClassLoader- The web application [/AppName] appears to have started a thread named [MultiThreadedHttpConnectionManager cleanup] but has failed to stop it. This is very likely to create a memory leak.
2012-04-25 21:46:00,339 [main] ERROR org.apache.catalina.loader.WebappClassLoader- The web application [/AppName] appears to have started a thread named [com.mchan
ge.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2] but has failed to stop it. This is very likely to create a memory leak.
2012-04-25 21:46:00,470 [main] ERROR org.apache.catalina.loader.WebappClassLoader- The web application [/AppName] is still processing a request that has yet to fin
ish. This is very likely to create a memory leak. You can control the time allowed for requests to finish by using the unloadDelay attribute of the standard Conte
xt implementation.
During that migration we went from Solr 1.4->Solr 3.6 in an attempt to fix the problem. When the errors above start filling the log the Solr error below follows right behind repeated 10-15 times and then tomcat stops working and I have to shutdown and startup to get it to respond.
2012-04-25 21:46:00,527 [main] ERROR org.apache.catalina.loader.WebappClassLoader- The web application [/solr] created a ThreadLocal with key of type [org.a
pache.solr.schema.DateField.ThreadLocalDateFormat] (value [org.apache.solr.schema.DateField$ThreadLocalDateFormat#1f1e90ac]) and a value of type [org.apache.solr.
schema.DateField.ISO8601CanonicalDateFormat] (value [org.apache.solr.schema.DateField$ISO8601CanonicalDateFormat#6b2ed43a]) but failed to remove it when the web a
pplication was stopped. This is very likely to create a memory leak.
My research has brought up a lot of suggestions about changing the code that manages threads to make sure they kill off DB pooled connections etc. but the this code has not been changed in nearly 12 months. Also the Solr application is crashing and that's 3rd party so my thinking is that this is environmental (jar conflict, versioning, config fat fingered?)
My last change was updating the mysql connector for java to the latest as some memory leak bugs existed around pooling in earlier releases but the server's just crashed again only a few hours later.
One thing I just noticed is I'm seeing thousands of sessions in the Tomcat web manager but that could be a red herring.
If anyone has seen this any help is very much appreciated.
[Edit]
I think I found the source of the problem. It wasn't a memory leak after all. I've taken over an application from another development team that uses c3p0 for database pooling via Hibernate. c3p0 has a bug/feature that if you don't release DB connections c3p0 can go into a waiting state once all the connections (via MaxPoolSize: default is 15) are used. It will wait indefinitely for a connection to become available. Hence my stall.
I upped the MaxPoolSize firstly from 25->100 and my application ran for several days without a hang and then from 100->1000 and it's been running steady ever since (over 2 weeks).
This isn't the complete solution as I need to find out why it's running out of pooled connections so I also set c3p0's unreturnedConnectionTimeout to 4hrs which enforces a 4hr time limit on all connections regardless of whether they're active or not. If it's an active connection it will close it and re-open again.
Not pretty and c3p0 don't recommend it but it gives me some breathing space to find out the source of the problem.
Note: when using c3p0 with Hibernate the settings are stored in your persistence.xml file but not all settings can be put there. Some settings (e.g. unreturnedConnectionTimeout) must go in c3p0.properties
You state that the sequence of events is:
errors appear
Tomcat stops responding
restart is required
However, the memory leak error messages only get reported when the web application is stopped. Therefore, something is triggering the web applications to stop (or reload). You need to figure out what is triggering this and stop it.
Regarding the actual leaks, you may find this useful:
http://people.apache.org/~markt/presentations/2010-11-04-Memory-Leaks-60mins.pdf
It looks both your app and Solr have some leaks that need to be fixed. The presentation will provide you with some pointers. I would also consider an upgrade to the latest 7.0.x. The memory leak detection has been improved and not all improvements have made it into 6.0.x yet.
I have a Java batch which does a select with a large resulset (I process the elements using a Spring callbackhandler).
The callbackhandler puts a task in a fixed threadpool to process the row.
My poolsize is fixed on 16 threads.
The resulset contains about 100k elements.
All db access code is handled through a JdbcTemplate or through Hibernate/Spring, no manual connection management is present.
I have tried with Atomikos and with Commons DBCP as connection pool.
Now, I would think that 17 max connections in my connectionpool would be enough to get this batch to finish. One for the select and 16 for the threads in the connectionpool which update some rows. However that seems to be too naive, as I have to specify a max pool size a magnitude larger (haven't tried for an exact value), first I tried 50 which worked on my local Windows machine, but doesn't seem to be enough on our Unix test environment. There I have to specify 128 to make it work (again, I didn't even try a value between 50 and 128, I went straight to 128).
Is this normal? Is there some fundamental mechanism in connection pooling I'm missing? I find it hard to debug this as I don't know how to see what happens with the opened connections. I tried various log4j settings but didn't get any satisfactory result.
edit, additional info: when the connectionpool size seems to be too low, the batch seems to hang. If I do a jstat on the process I can see all threads are waiting for a new connection. At first I didn't specify the maxWait property on the dbcp connection pool, which causes threads to wait indefinitely on a new connection, and I noticed the batch kept hanging. So no connections were released. However, that only happened after processing +-70k rows, which dismissed my initial hunch of connection leakage somehow.
edit2: I forgot to mention I already rewrote the update part in my tasks. I qeueu my updates in a ConcurrentLinkedQueue, I empty that on 1000 elements. So i actually only do about 100 updates.
edit3: I'm using Oracle and I am using the concurrent utils. So i have an executor configured with fixed poolsize of 16. I submit my tasks on this executor. I don't use connections manually in my tasks, I use jdbctemplate which is threadsafe and asks it connections from the connectionpool. I suppose Spring/DBCP handles the connection/thread issue.
If you are using linux, you can try MySql administrator to monitor you connection status graphically, provided you are using MySQL.
Irrespective of that, even 100 connections is not uncommon for large enterprise applications, handling a few thousand requests per minute.
But if the requests are low or each request doesnt need unique a transaction, then I would recommend you to tune your operation inside threads.
That is, how are you distributing the 100k elements to 16 threads?
If you try to acquire the connection every time you read a row from the shared location(or buffer), then it is expected to take time.
See whether this helps.
getConnection
for each element until the buffer size becomes zero
process it.
if you need to update,
open a transaction
update
commit/rollback the transaction
go to step 2
release the connection
you can synchronize the buffer by using java.util.concurrent collections
Dont use one Runnable/Callable for each element. This will degrade the performance.
Also how are you creating threads? use Executors to run your runnable/callable. Also remember that DB connections are NOT expected to be shared across threads. So use 1 connection in 1 thread at a time.
For eg. create an Executor and submit 16 runnalbles each having its own connection.
I switched to c3p0 instead of DBCP. In c3p0 you can specify a number of helper threads. I notice if I put that number as high as the number of threads I'm using, the number of connections stays really low (using the handy jmx bean of c3p0 to inspect the active connections). Also, I have several dependencies with each its own entity manager. Apparently a new connection is needed for each entity manager, so I have about 4 entitymanagers/thread, which would explain the high number of connections. I think my tasks are all so short-lived that DBCP couldn't follow with closing/releasing connections, since c3p0 works more asynchronous and you can specify the number of helperthreads, it is able to release my connections in time.
edit: but the batch keeps hanging when deployed to the test environment, all threads are blocking when releasing the connection, the lock is on the pool. Just the same as with DBPC :(
edit: all my problems dissapeared when I switched to BoneCP, and I got a huge performance increase as bonus too
We are running a Java EE web application in JBoss that is using PostgreSQL 8.0.9 as the database.
One page in the application runs a big and complicated query when it is loaded. We had a problem that manifested if a user requested this page and closed their browser window before the requested page was returned to the client. The problem was that the closing of the window would spawn a new PostgreSQL thread/process (viewable via top) and the new thread/process would take a long time to switch from SELECT to idle in the top output. If approximately 5 or more users did this (closed the browser window before the large complicated query page returned to the client) in a small window of time the spawned threads/processes were growing and not switching to idle (staying in SELECT) and consuming a lot of CPU, causing major performance problems. It is important to mention that if the users that closed the browser window logged out, the associated thread/process would switch to idle and the CPU use would decrease. It is also important to mention that if JBoss was restarted the applicable threads/processes would switch to idle (as all the users would be logged out by the restart).
The problem of the hanging threads/processes seems to have been resolved by a database backup and RESTORE. Now the new threads/processes that are spawned are switched from SELECT to idle in a generally short period of time and the CPU is not burdened by them as much. Also, performance on large complicated queries in general seems to have improved significantly since the RESTORE.
We run VACUUM every 24 hours on the database. We do not run REINDEX on the database because of data corruption risks. We do tend to have rather high await numbers on iostat outputs, especially in the performance problem cases described above.
What happens to a database when it is dumped and restored (ex. REINDEX, etc.)? Which one of these seems to be the key to our solution?
Is there a setting that manages the number of threads/processes that are spawned when browser windows are closed before a page with a large complicated query is returned to the client? Is there a setting to manage the transition of threads/processes like this from SELECT to idle? Is there away to manage either of these at the application level?
Version 8.0 is already EOL and version 8.0.9 hasn't been patched in a long time as well: 8.0.26 has been the last. You are missing many patches and should at least update to the latest 8.0-version, but also start a migration to a version that is still supported. Since version 8.2 and 8.3, performance has become much better.
Question: Why do you think REINDEX corrupts your data? Corruption of data would make this statement pretty useless... REINDEX is not something you would do every day, but sometimes you need it.