We are running a Java EE web application in JBoss that is using PostgreSQL 8.0.9 as the database.
One page in the application runs a big and complicated query when it is loaded. We had a problem that manifested if a user requested this page and closed their browser window before the requested page was returned to the client. The problem was that the closing of the window would spawn a new PostgreSQL thread/process (viewable via top) and the new thread/process would take a long time to switch from SELECT to idle in the top output. If approximately 5 or more users did this (closed the browser window before the large complicated query page returned to the client) in a small window of time the spawned threads/processes were growing and not switching to idle (staying in SELECT) and consuming a lot of CPU, causing major performance problems. It is important to mention that if the users that closed the browser window logged out, the associated thread/process would switch to idle and the CPU use would decrease. It is also important to mention that if JBoss was restarted the applicable threads/processes would switch to idle (as all the users would be logged out by the restart).
The problem of the hanging threads/processes seems to have been resolved by a database backup and RESTORE. Now the new threads/processes that are spawned are switched from SELECT to idle in a generally short period of time and the CPU is not burdened by them as much. Also, performance on large complicated queries in general seems to have improved significantly since the RESTORE.
We run VACUUM every 24 hours on the database. We do not run REINDEX on the database because of data corruption risks. We do tend to have rather high await numbers on iostat outputs, especially in the performance problem cases described above.
What happens to a database when it is dumped and restored (ex. REINDEX, etc.)? Which one of these seems to be the key to our solution?
Is there a setting that manages the number of threads/processes that are spawned when browser windows are closed before a page with a large complicated query is returned to the client? Is there a setting to manage the transition of threads/processes like this from SELECT to idle? Is there away to manage either of these at the application level?
Version 8.0 is already EOL and version 8.0.9 hasn't been patched in a long time as well: 8.0.26 has been the last. You are missing many patches and should at least update to the latest 8.0-version, but also start a migration to a version that is still supported. Since version 8.2 and 8.3, performance has become much better.
Question: Why do you think REINDEX corrupts your data? Corruption of data would make this statement pretty useless... REINDEX is not something you would do every day, but sometimes you need it.
Related
First of all I know it's odd to rely on a manual vacuum from the application layer, but this is how we decided to run it.
I have the following stack :
HikariCP
JDBC
Postgres 11 in AWS
Now here is the problem. When we start fresh with brand new tables with autovacuum=off the manual vacuum is working fine. I can see the number of dead_tuples growing up to the threshold then going back to 0. The tables are being updated heavily in parallel connections (HOT is being used as well). At some point the number of dead rows is like 100k jumping up to the threshold and going back to 100k. The n_dead_tuples slowly creeps up.
Now the worst of all is that when you issue vacuum from a pg console ALL the dead tuples are cleaned, but oddly enough when the application is issuing vacuum it's successful, but partially cleans "threshold amount of records", but not all ?
Now I am pretty sure about the following:
Analyze is not running, nor auto-vacuum
There are no long running transactions
No replication is going on
These tables are "private"
Where is the difference between issuing a vacuum from the console with auto-commit on vs JDBC ? Why the vacuum issued from the console is cleaning ALL the tupples whereas the vacuum from the JDBC cleans it only partially ?
The JDBC vacuum is ran in a fresh connection from the pool with the default isolation level, yes there are updates going on in parallel, but this is the same as when a vacuum is executed from the console.
Is the connection from the pool somehow corrupted and can not see the updates? Is the ISOLATION the problem?
Visibility Map corruption?
Index referencing old tuples?
Side-note: I have observed that same behavior with autovacuum on and cost limit through the roof like 4000-8000 , threshold default + 5% . At first the n_dead_tuples is close to 0 for like 4-5 hours... The next day the table is 86gigs with milions of dead tuples. All the other tables are vacuumed and ok...
PS: I will try to log a vac verbose in the JDBC..
PS2: Because we are running in AWS could it be a backup that is causing it to stop cleaning ?
PS3: When refering to vaccum I mean simple vacuum, not full vacuum. We are not issuing full vacuum.
The main problem was that vacuum is run by another user. The vacuuming that I was seeing was the HOT updates + selects running over that data resulting in on-the-fly vacuum of the page.
Next: Vacuuming is affected by long running transactions ACROSS ALL schemas and tables. Yes, ALL schemas and tables. Changing to the correct user fixed the vacuum, but it will get ignored if there is an open_in_transaction in any other schema.table.
Work maintance memory helps, but in the end when the system is under heavy load all vacuuming is paused.
So we upgraded the DB's resources a bit and added a monitor to help us if there are any issues.
Problem Description: We have a web application which is used by 200-300 people per day. The application slows down twice or thrice in a day at certain hours changing it's home page load time from 6-7 seconds to 11-13 seconds. This application is deployed on JBoss AS 7.2. There are 4-5 other applications which are deployed on the same Jobss on the same instance(port number). These application are web services(REST & SOAP webservices which are used by other applications of same company which I am not aware about) that use the same database as the Main application which is having slowness issues. The application is built with the following technology stack:
Frontend: Angular JS, Angular UI, JqueryUI, JSON
Backend: Spring REST controllers, Java 7, JDBC
Database: Oracle 11g, PL SQL
It's been only 4 months since the application response time as soared up. We had a production release 4 months ago, in which lots of data filtering is done on the basis of certain parameters. This code is implemented in PL/SQL. Also some filtering of data is done in front end. The response time has increased after this release. (Note: During this period number of users and data has also increased by a significant amount)
So far I have tried to improve the performance by minimising Javascript files changing 2.8 MB of DOM downloaded Content to just 1.2 MB. I have also optimised some of the queries which are being used for data filtering. I have been able to bring homepage load time down to average 9-10 seconds. Which is still quite more than client's expectation.
I would like to know how to tackle this kind of issues and what all things should I bear in mind which might have been causing this problem.
At present production jvm configurations are xms: 64 MB, xmx: 256 MB. will changing increasing the memory help?
Should I remove the PLSQL codeand write Java code and use multithreading?
During the peak time CPU usage gets quite high around 85-95 percent. The main tables are used by many applications(cron job which calls java program to send email notifications) What can be done about it?
I have fixed this issue now. As per comments and suggestions, I timed queries to database, checked database logs & monitored daily CPU usage. I did same for application server and analysed using jvisualvm.
I did a bit of everything like minimising static content, optimise queries, remove unnecessary logging, The significant change however has come from changing JVM tuning (heap size -xms:1024M, -xmx:1536M & permGen: 512M and other things). The performance has now improved a lot changing average home page(after login time) to 4 to 5 seconds(from 10-13 seconds).
There are still chances of improvement from Database point of view. Some queries and PL SQL blocks have to be optimized. But it's nonetheless quite better than before.
I have been fighting with this issue for ages and I cannot for the life of me figure out what the problem is. Let me set the stage for the stack we are using:
Web-based Java 8 application
GWT
Hibernate 4.3.11
MySQL
MongoDB
Spring
Tomcat 8 (incl Tomcat connection pooling instead of C3PO, for example)
Hibernate Search / Lucene
Terracotta and EhCache
The problem is that every couple of days (sometimes every second day, sometimes once every 10 days, it varies) in the early hours of the morning, our application "locks up". To clarify, it does not crash, you just cannot log in or do anything for that matter. All background tasks - everything - just halt. If we attempt to login when it is in this state, we can see in our log file that it is authenticating us as a valid user, but no response is ever sent so the application just "spins".
The only pattern we have found to date related to when these "lock ups" occur is that it happens when our morning scheduled tasks or SAP imports are running. It is not always the same process that is running though, sometimes the lock up happens during one of our SAP imports and sometimes during internal scheduled task execution. All that these things have in common are that they run outside of business hours (between 1am and 6am) and that they are quite intensive processes.
We are using JavaMelody for monitoring and what we see every time is that starting at different times in this 1 - 6am window, the number of used jdbc connections just start to spike (as per the attached image). Once that starts, it is just a matter of time before the lock up occurs and the only way to solve it is to bounce Tomcat thereby restarting the application.
As for as I can tell, memory, CPU, etc, are all fine when the lock up occurs the only thing that looks like it has an issue is the constantly increasing number of used jdbc connections.
I have checked the code for our transaction management so many times to ensure that transactions are being closed off correctly (the transaction management code is quite old fashioned: explicit begin and commit in try block, rollback in catch blocks and entity manager close in a finally block). It all seems correct to me so I am really, really stumped. In addition to this, I have also recently explicitly configured the Hibernate connection release mode properly to after_transaction, but the issue still occurs.
The other weird thing is that we run several instances of the same application for different clients and this issue only happens regularly for one client. They are our client with by far the most data to be processed though and although all clients run these scheduled tasks, this big client is the only one with SAP imports. That is why I originally thought that the SAP imports were the issue, but it locked up just after 1am this morning and that was a couple hours before the imports even start running. In this case it locked up during an internal scheduled task executing.
Does anyone have any idea what could be causing this strange behavior? I have looked into everything I can think of but to no avail.
After some time and a lot of trial and error, my team and I managed to sort out this issue. Turns out that the spike in JDBC connections was not the cause of the lock-ups but was instead a consequence of the lock-ups. Apache Terracotta was the culprit. It was just becoming unresponsive it seems. It might have been a resource allocation issue, but I don't think so since this was happening on servers that were low usage as well and they had more than enough resources available.
Fortunately we actually no longer needed Terracotta so I removed it. As I said in the question, we were getting these lock-ups every couples of days - at least once per week, every week. Since removing it we have had no such lock-ups for 4 months and counting. So if anyone else experiences the same issue and you are using Terracotta, try dropping it and things might come right, as they did in my case.
As said by coladict, you need to look at "Opened jdbc connections" page in the javamelody monitoring report and before the server "locks up".
Sorry if you need to do that at 2h or 3h in the morning, but perhaps you can run a wget command automatically in the night.
So, we have already deploying an application, which consist a heavy business logic that my company uses. After some time, the performance was quite slower than before, actually in the weblogic data source configuration, we set the maximum connection to only 100, but recently it keeps on increasing until its limit.
We reconfigure the data source to 200, but it keeps on increasing, this is not ideal, because 100 is the max connection that we want it to be deployed.
Meanwhile, there were some thread stuck in the server too. But i think it's not the problem. Do someone knows why is this occuring so suddenly? (after implementation of a newer yet stable version, they said)
From the screenshot attached I can see that Active Connection Count is ~80. Also I can see connection are being leaked.
Enable Inactive Connection Timeout by defining some value (Based on avg time taken to execute statement).
Make sure that all JDBC connections are closed in your code after using it.
I'm loading about 1 million records into Oracle using a custom Java utility. The Java utility is multi-threaded and has worked numerous times in the past with no problem. My issue is that when I start the load for the very first time, it is lightning fast, around 150K object per hour. After about an hour or 2 the performance greatly decreases to around 6000 objects per hour. I'm almost certain that my performance hit has something to do with Oracle, but I can't figure out what it is. The Oracle machine has 16GB of RAM and 8 CPUs. I set the following system parameters, that have worked for me in the past:
optimizer_mode=ALL_ROWS
optimizer_index_cost_adj=10
query_rewrite_integrity=ENFORCED
pga_aggregate_target=300M
sga_target=5000M
sga_max_size=5000M
Does anyone have any Oracle knowledge to maybe know why my performance is great initially but drops off drastically? One additional note, if I stop the load, restart the machine, then start the load again, I continue to see the 6000 object per hour performance. So it's always the very first load after cloning our Production database that has the best performance. Hopefully someone has an idea, thanks in advance!!
I assume that the load is only inserts and that the distribution of the data changes over time.
Or are it continuous inserts into the same table, like loading continuously Call Detail Records of a phone system?
In principe Oracle does not easily get slower with increasing and lasting use. But there are some ways to make it run slower:
Locks / latches
I would recommend checking that concurrent use by other Oracle sessions is not causing the problems due to short locks or latches. Given that it are inserts, it could maybe be the other threads trying to insert in the same data blocks given the distribution of the data which might become different after some time.
Restricted inserts per block
Please check that max_trans on the tables is not restricted to 1 or 2. I've seen that once and it was really funny to see how Oracle got down to a crawl when only one session can do something in a block.
SGA and kernel problems
With older Oracle releases (Oracle 7 and 8) I've seen numerous occassions on large systems where Oracle started to kill itself. This especially holds for multiprocessor systems, because locking/latching on a MP-system is implemented differently: the other processor might get it's work done, so an Oracle threads first just spins a little and then tries again. Also, problems with SGA fragmentation or even bad locking of the SGA can cause problems.
Please check that the insert statements use bind variables, batches or bypass SQL completely. You might also want to try running it in one thread. Is one thread processing stable over time (although slower)? If so, you have a locking issue somewhere. Google for locks/latches/spins and follow scenarios listed.