SOLR suddenly crashes with OutOfMemoryException

SOLR suddenly crashes with OutOfMemoryException - java

I ran into a problem with Solr going OutOfMemory. The situation is as follows. We had 2 Amazon EC2 small instances (3.5G) each running a Spring/BlazeDS backend in Tomcat 6 (behind a loadbalancer). Each instance has its own local Solr instance. The index size on disk is about 500M. The JVM settings were since months (Xms=512m,Xmx=768). We use Solr to find people based on properties they entered in their profile and documents they uploaded. We're not using the Solr update handler, only the select. Updates are done using deltaImports. The Spring app in each Tomcat instance has a job that triggers the /dataimport?command=delta-import handler every 30 seconds.
This worked well for months, even for over a year if I'm correct (I'm not that long on the project). CPU load was at a minimum, with exceptionally some peaks.
The past week we suddenly had OutOfMemory crashes of SOLR on both machines. I reviewed my changes over the past few weeks, but none of the seamed related to SOLR. Bugfixes in the UI, something email related, but again: nothing in the SOLR schema or queries.
Today, we changed the Ec2 instances to m1.large (7.5G) and the SOLR JVM settings to -Xms=2048 / -Mmx=3072. This helped a bit, they run for 3 a 4 hours, but eventually, they crash too.
Oh, and the dataset (number of rows, documents, entities, etc) did not change significantly. There is a constant growth, but it doesn't make sense to me when I triple the JVM memory, that it still crashes...
The question: have you any directions to point me to?

Measure, not guess. Instead of guessing, what has changed, which could lead to your problems, you would better take some memory leak detection tool, e.g. Plumbr. Run your Solr with the tool attached and see, is it will tell you the exact reason of memory leak.

Take a look at your Solr cache settings. Reducing the size of the document cache has helped us stabilize a Solr 3.6 server that was also experiencing OutOfMemory errors. The query result cache size may also be relevant in your case, it was not in mine.
You can see your Solr cache usage on the admin page for your core:
http://localhost:8983/solr/core0/admin/stats.jsp#cache
(Replace core0 with the name of your Solr core)
documentCache
https://wiki.apache.org/solr/SolrCaching#documentCache
queryResultCache
https://wiki.apache.org/solr/SolrCaching#queryResultCache

Related

Profiling memory leak in a non-redundant uptime-critical application

We have a major challenge which have been stumping us for months now.
A couple of months ago, we took over the maintenance of a legacy application, where the last developer to touch the code, left the company several years ago.
This application needs to be more or less always online. It's developed many years ago without staging and test environments, and without a redundant infrastructure setup.
We're dealing with a legacy Java EJB application running on Payara application server (Glassfish derivative) on an Ubuntu server.
Within the last year or two, it has been necessary to restart Payara approximately once a week, and the Ubuntu server once a month.
This is due to a memory leak which slows down the application over a period of around a week. The GUI becomes almost entirely non-responsive, but a restart of Payara fixes this, at least for a while.
However after each Payara restart, there is still some kind of residual memory use. The baseline memory usage increases, thereby reducing the time between Payara restarts. Around every month, we thus do a full Ubuntu reboot, which fixes the issue.
Naturally we want to find the memory leak, but we are unable to run a profiler on the server because it's resource intensive, and would need to run for several days in order to capture the memory leak.
We have also tried several times to dump the heap using "gcore" command, but it always result in a segfault and then we need to reboot the Ubuntu server.
What other options / approaches do we have to figure out which objects in the heap are not being garbage collected?

I would try to clone the server in some way to another system where you can perform tests without clients being affected. Could even be a system with less resources, if you want to trigger a resource based problem.
To be able to observe the memory leak without having to wait for days, I would create a load test, maybe with Apache JMeter, to simulate accesses of a week within a day or even hours or minutes (don't know if the base load is at a level where that is feasible from the server and network infrastructure).
First you could set up the load test to act as a "regular" mix of requests like seen in the wild. After you can trigger the loss of response, you can try to find out, if there are specific requests that are more likely to be the cause for the leak than others. (It also could be that some basic component that is reused in nearly any call contains the leak, and so you cannot find out "the" call with the leak.)
Then you can instrument this test server with a profiler.
To get another approach (you could do it in parallel) you also can use a static code inspection tool like SonarQube to analyze the source code for typical patterns of memory leaks.
And one other idea comes to my mind, but it is coming with many preconditions: if you have recorded typical scenarios for the backend calls, and if you have enough development resources, and if it is a stateless web application where each call could be inspoected more or less individually, then you could try to set up partial integration tests where you simulate the incoming web calls, with database and file access, but if possible without the application server, and record the increase of the heap usage after each of the calls. Statistically you might be able to find out the "bad" call this way. (So this would be something I would try as very last option.)

Apart from heap dump have to tried any realtime app perf monitoring (APM) like appdynamics or the opensource alternative like https://github.com/scouter-project/scouter.
Alternate approach would be to analyse existing application issue Eg: Payara issues like these https://github.com/payara/Payara/issues/4098 or maybe the ubuntu patch you are currently running app on.

You can use jmap, an exe bundled with the JDK, to check the memory. From the documentation:-
jmap prints shared object memory maps or heap memory details of a given process or core file or a remote debug server.
For more information you can see the documentation or see the stackoverflow question How to analyse the heap dump using jmap in java
There is also a tool called jhat which can be used tp analise java heap.
From the documentation:-
The jhat command parses a java heap dump file and launches a webserver. jhat enables you to browse heap dumps using your favorite webbrowser. jhat supports pre-designed queries (such as 'show all instances of a known class "Foo"') as well as OQL (Object Query Language) - a SQL-like query language to query heap dumps. Help on OQL is available from the OQL help page shown by jhat. With the default port, OQL help is available at http://localhost:7000/oqlhelp/
See JHat Dcoumentation, or How to analyze the heap dump using jhat

Java based web application running very slows twice or thice in a day at certain hours

Problem Description: We have a web application which is used by 200-300 people per day. The application slows down twice or thrice in a day at certain hours changing it's home page load time from 6-7 seconds to 11-13 seconds. This application is deployed on JBoss AS 7.2. There are 4-5 other applications which are deployed on the same Jobss on the same instance(port number). These application are web services(REST & SOAP webservices which are used by other applications of same company which I am not aware about) that use the same database as the Main application which is having slowness issues. The application is built with the following technology stack:
Frontend: Angular JS, Angular UI, JqueryUI, JSON
Backend: Spring REST controllers, Java 7, JDBC
Database: Oracle 11g, PL SQL
It's been only 4 months since the application response time as soared up. We had a production release 4 months ago, in which lots of data filtering is done on the basis of certain parameters. This code is implemented in PL/SQL. Also some filtering of data is done in front end. The response time has increased after this release. (Note: During this period number of users and data has also increased by a significant amount)
So far I have tried to improve the performance by minimising Javascript files changing 2.8 MB of DOM downloaded Content to just 1.2 MB. I have also optimised some of the queries which are being used for data filtering. I have been able to bring homepage load time down to average 9-10 seconds. Which is still quite more than client's expectation.
I would like to know how to tackle this kind of issues and what all things should I bear in mind which might have been causing this problem.
At present production jvm configurations are xms: 64 MB, xmx: 256 MB. will changing increasing the memory help?
Should I remove the PLSQL codeand write Java code and use multithreading?
During the peak time CPU usage gets quite high around 85-95 percent. The main tables are used by many applications(cron job which calls java program to send email notifications) What can be done about it?

I have fixed this issue now. As per comments and suggestions, I timed queries to database, checked database logs & monitored daily CPU usage. I did same for application server and analysed using jvisualvm.
I did a bit of everything like minimising static content, optimise queries, remove unnecessary logging, The significant change however has come from changing JVM tuning (heap size -xms:1024M, -xmx:1536M & permGen: 512M and other things). The performance has now improved a lot changing average home page(after login time) to 4 to 5 seconds(from 10-13 seconds).
There are still chances of improvement from Database point of view. Some queries and PL SQL blocks have to be optimized. But it's nonetheless quite better than before.

Loading into Oracle with a Java utility

I'm loading about 1 million records into Oracle using a custom Java utility. The Java utility is multi-threaded and has worked numerous times in the past with no problem. My issue is that when I start the load for the very first time, it is lightning fast, around 150K object per hour. After about an hour or 2 the performance greatly decreases to around 6000 objects per hour. I'm almost certain that my performance hit has something to do with Oracle, but I can't figure out what it is. The Oracle machine has 16GB of RAM and 8 CPUs. I set the following system parameters, that have worked for me in the past:
optimizer_mode=ALL_ROWS
optimizer_index_cost_adj=10
query_rewrite_integrity=ENFORCED
pga_aggregate_target=300M
sga_target=5000M
sga_max_size=5000M
Does anyone have any Oracle knowledge to maybe know why my performance is great initially but drops off drastically? One additional note, if I stop the load, restart the machine, then start the load again, I continue to see the 6000 object per hour performance. So it's always the very first load after cloning our Production database that has the best performance. Hopefully someone has an idea, thanks in advance!!

I assume that the load is only inserts and that the distribution of the data changes over time.
Or are it continuous inserts into the same table, like loading continuously Call Detail Records of a phone system?
In principe Oracle does not easily get slower with increasing and lasting use. But there are some ways to make it run slower:
Locks / latches
I would recommend checking that concurrent use by other Oracle sessions is not causing the problems due to short locks or latches. Given that it are inserts, it could maybe be the other threads trying to insert in the same data blocks given the distribution of the data which might become different after some time.
Restricted inserts per block
Please check that max_trans on the tables is not restricted to 1 or 2. I've seen that once and it was really funny to see how Oracle got down to a crawl when only one session can do something in a block.
SGA and kernel problems
With older Oracle releases (Oracle 7 and 8) I've seen numerous occassions on large systems where Oracle started to kill itself. This especially holds for multiprocessor systems, because locking/latching on a MP-system is implemented differently: the other processor might get it's work done, so an Oracle threads first just spins a little and then tries again. Also, problems with SGA fragmentation or even bad locking of the SGA can cause problems.
Please check that the insert statements use bind variables, batches or bypass SQL completely. You might also want to try running it in one thread. Is one thread processing stable over time (although slower)? If so, you have a locking issue somewhere. Google for locks/latches/spins and follow scenarios listed.

How to profile hibernate JPA session information?

We have Spring Hibernate JPA web application in production. There is suspect of memory leak in session objects. We are uploading excel records using Apache POI into MYSQL. Commit frequency is 10 records, but each commit takes 5 to 10 seconds pause and CPU reaches almost 100% throughout the import process. Is there any way to profile the hibernate sessions in my application and find what process is causing such a high CPU usage. I was checking out Rhinos Hibernate Profiler but it seems to confusing in configuration and needs changes in code. Since we need to profile production or stage instance, is there any JPA Hibernate Session information profiler without much changes to application configuration/code?

Use Visual VM, with all the plug-ins installed. Attach it to your app's JVM PID when you start it up. It'll show you memory, threads, and lots more.
I don't think it'll be a good idea to profile on a production server. Put that code on another box and run a significant load for a long period of time. That'll show the problem.

Is this the JConsole plugin you're looking for?
http://hibernate-jcons.sourceforge.net/

JVM, Tomcat on 64 bit Linux

On 32 bit systems, JVM has a memory limit of 1.5 to 2 GB. What is a good value of JVM memory on 64 bit Linux ? How that can be mapped to maximum number of threads and maximum requests in tomcat ?
I am using JDK 6+ and tomcat 7. RAM available will be 12 GB on a quad core processor.
MRD

I don't think there's an out of the box answer to this question. This depends heavily on what kind of applications you are going to host and how much is it going to be on your system. I administer a small server with 3-4 applöications on a 64bit linux system. Using 4GB is more than enough for me.
My advise is make a wild guess how much ram is required for your applications. Then startup tomcat with a monitor tool then watch how much load is there on your tomcat. You might have allocated too much resource for tomcat. Maybe too few. you never know
Please read this article on Simultaneous users, and also the article about load balancing in tomcat
Basically you have to differentiate between users and requests. you might have 5000 users browsing your site, but only 100 making requests for a new page at one moment. By default tomcat supports 50 concurrent requests (not 100% sure though). But this number can be changed in your tomcat configuration. Obviously you might need more hardware. In the second article, max 200 requests per tomcat instance is recommended. only simple calculation rules as mentioned in the second article and doing some monitoring can help.
There's even a load balancer manager for tomcat. Check it out
load balancer for tomcat
One more thing to think of, is although you have the hardware and the right load balancing to support 5000 users, you also need enough bandwidth to do so. Again explained in the second article "load balancing in tomcat"
Good luck

It depends on how many users will visit you application simultaneously.
Sometimes, the app will run very slowly at a particular time point,
For instance, at 8:00 AM, login action causes the app can't stand.
I suggest you to estimate average memory per user, according the “total number of users",
Then you may get a nearly almost RIGHT memory setting.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.