JMETER: JMeter 5.3 java.lang.OutOfMemoryError. During Jmeter execution

JMETER: JMeter 5.3 java.lang.OutOfMemoryError. During Jmeter execution - java

I have configured a Testplan using Jmeter shown below in the image and have been using the CLI to run my parallel load tests. MAC USER
I have configured a connection with my AWS RedShift database, when I check my queries monitoring, all of the queries get stuck in a Running state.
After some time, on my terminal, i get the following error: JMeter 5.3 java.lang.OutOfMemoryError.
I have gone into my bin/jemeter file and have made the memory changes but I am still facing the same issue.
When I run the same queries from DBeaver, the queries are run and completed and can be seen on Redshift query monitoring.
How can I solve the memory problem in order for the queries to run without being stuck in a running state?
Below is the Error i am getting even after increasing the heap size to 5 gigabytes.
WARNING: package sun.awt.X11 not in java.desktop
Creating summariser <summary>
Created the tree successfully using //Users/mbyousaf/Desktop/redshit-test/test-redhsift.jmx
Starting standalone test # Wed Dec 02 14:53:17 GMT 2020 (1606920797442)
Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445
Warning: Nashorn engine is planned to be removed from a future JDK release
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid35596.hprof ...
Heap dump file created [3071802740 bytes in 3.747 secs]

Which exact OutOfMemoryError? There are several possible reasons:
Lack of heap space, if this is the case - you're looking at the right place, just make sure that your changes are applied
GC Overhead Limit Exceeded occurs when the GC executing almost 100% of time not leaving the program any chance to do its job
Requested array size exceeds VM limit when the program tries to create too large objects
Unable to Create New Native Thread when the program cannot create a new thread because the operating system doesn't allow it
and so on
It's not possible to state what's wrong without seeing your full test plan (at least screenshot) as it might be the case you added tons of Listeners and each of them stores large DB query response in memory and jmeter.log file (definitely not in the form of screenshot) which in the majority of cases contains either the cause of the problem or at least a clue

Related

Profiling memory leak in a non-redundant uptime-critical application

We have a major challenge which have been stumping us for months now.
A couple of months ago, we took over the maintenance of a legacy application, where the last developer to touch the code, left the company several years ago.
This application needs to be more or less always online. It's developed many years ago without staging and test environments, and without a redundant infrastructure setup.
We're dealing with a legacy Java EJB application running on Payara application server (Glassfish derivative) on an Ubuntu server.
Within the last year or two, it has been necessary to restart Payara approximately once a week, and the Ubuntu server once a month.
This is due to a memory leak which slows down the application over a period of around a week. The GUI becomes almost entirely non-responsive, but a restart of Payara fixes this, at least for a while.
However after each Payara restart, there is still some kind of residual memory use. The baseline memory usage increases, thereby reducing the time between Payara restarts. Around every month, we thus do a full Ubuntu reboot, which fixes the issue.
Naturally we want to find the memory leak, but we are unable to run a profiler on the server because it's resource intensive, and would need to run for several days in order to capture the memory leak.
We have also tried several times to dump the heap using "gcore" command, but it always result in a segfault and then we need to reboot the Ubuntu server.
What other options / approaches do we have to figure out which objects in the heap are not being garbage collected?

I would try to clone the server in some way to another system where you can perform tests without clients being affected. Could even be a system with less resources, if you want to trigger a resource based problem.
To be able to observe the memory leak without having to wait for days, I would create a load test, maybe with Apache JMeter, to simulate accesses of a week within a day or even hours or minutes (don't know if the base load is at a level where that is feasible from the server and network infrastructure).
First you could set up the load test to act as a "regular" mix of requests like seen in the wild. After you can trigger the loss of response, you can try to find out, if there are specific requests that are more likely to be the cause for the leak than others. (It also could be that some basic component that is reused in nearly any call contains the leak, and so you cannot find out "the" call with the leak.)
Then you can instrument this test server with a profiler.
To get another approach (you could do it in parallel) you also can use a static code inspection tool like SonarQube to analyze the source code for typical patterns of memory leaks.
And one other idea comes to my mind, but it is coming with many preconditions: if you have recorded typical scenarios for the backend calls, and if you have enough development resources, and if it is a stateless web application where each call could be inspoected more or less individually, then you could try to set up partial integration tests where you simulate the incoming web calls, with database and file access, but if possible without the application server, and record the increase of the heap usage after each of the calls. Statistically you might be able to find out the "bad" call this way. (So this would be something I would try as very last option.)

Apart from heap dump have to tried any realtime app perf monitoring (APM) like appdynamics or the opensource alternative like https://github.com/scouter-project/scouter.
Alternate approach would be to analyse existing application issue Eg: Payara issues like these https://github.com/payara/Payara/issues/4098 or maybe the ubuntu patch you are currently running app on.

You can use jmap, an exe bundled with the JDK, to check the memory. From the documentation:-
jmap prints shared object memory maps or heap memory details of a given process or core file or a remote debug server.
For more information you can see the documentation or see the stackoverflow question How to analyse the heap dump using jmap in java
There is also a tool called jhat which can be used tp analise java heap.
From the documentation:-
The jhat command parses a java heap dump file and launches a webserver. jhat enables you to browse heap dumps using your favorite webbrowser. jhat supports pre-designed queries (such as 'show all instances of a known class "Foo"') as well as OQL (Object Query Language) - a SQL-like query language to query heap dumps. Help on OQL is available from the OQL help page shown by jhat. With the default port, OQL help is available at http://localhost:7000/oqlhelp/
See JHat Dcoumentation, or How to analyze the heap dump using jhat

How to check why job gets killed on Google Dataflow ( possible OOM )

I've got the simple task. I've got a bunch of files ( ~100GB in total ), each line represents one entity. I have to send this entity to JanusGraph server.
2018-07-07_05_10_46-8497016571919684639 <- job id
After a while, I am getting OOM, logs say that Java gets killed.
From dataflow view, i can see the following logs:
Workflow failed. Causes: S01:TextIO.Read/Read+ParDo(Anonymous)+ParDo(JanusVertexConsumer) failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
From stackdriver view, I can see: https://www.dropbox.com/s/zvny7qwhl7hbwyw/Screenshot%202018-07-08%2010.05.33.png?dl=0
Logs are saying:
E Out of memory: Kill process 1180 (java) score 1100 or sacrifice child
E Killed process 1180 (java) total-vm:4838044kB, anon-rss:383132kB, file-rss:0kB
More here: https://pastebin.com/raw/MftBwUxs
How can I debug what's going on?

There is too few information to debug the issue right now, so I am providing general information about Dataflow.
The most intuitive way for me to find the logs is going to Google Cloud Console -> Dataflow -> Select name of interest -> upper right corner (errors + logs).
More detailed information about monitoring is described here (in beta phase).
Some basic clues to troubleshoot the pipeline, as well as the most common error messages, are described here.
If you are not able to fix the issue, update the post with the error information please.
UPDATE
Based on the deadline exceeded error and the information you shared, I think your job is "shuffle-bound" for memory exhaustion. According to this guide:
Consider one of, or a combination of, the following courses of action:
Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects//zones//diskTypes/pd-ssd"
when you run your pipeline.
UPDATE 2
For specific OOM errors you can use:
--dumpHeapOnOOM will cause a heap dump to be saved locally when the JVM crashes due to OOM.
--saveHeapDumpsToGcsPath=gs://<path_to_a_gcs_bucket> will cause the heap dump to be uploaded to the configured GCS path on next worker restart. This makes it easy to download the dump file for inspection. Make sure that the account the job is running under has write permissions on the bucket.
Please take into account that heap dump support has some overhead cost and dumps can be very large. These flags should only be used for debugging purposes and always disabled for production jobs.
Find other references on DataflowPipelineDebugOptions methods.
UPDATE 3
I did not find public documentation about this but I tested that Dataflow scales the heap JVM size with the machine type (workerMachineType), which could also fix your issue. I am with GCP Support so I filed two documentation requests (one for a description page and another one for a dataflow troubleshooting page) to update the documents to introduce this information.
On the other hand, there is this related feature request which you might find useful. Star it to make it more visible.

Java job gives OOM error inconsistently

I have scheduled(cron) a jar file on Linux box. The jar connects with Hive server over JDBC and runs select query, after that I write the selected data in csv file. The daily data volume is around 150 Million records and the csv file is approx. of size 30GB.
Now, this job does not completes every time it is invoked and results in writing part of data. I checked the PID for error with dmesg | grep -E 31866 and I can see:
[1208443.268977] Out of memory: Kill process 31866 (java) score 178 or sacrifice child
[1208443.270552] Killed process 31866 (java) total-vm:25522888kB, anon-rss:11498464kB, file-rss:104kB, shmem-rss:0kB
I am invoking my jar with memory options like :
java -Xms5g -Xmx20g -XX:+UseG1GC -cp jarFile
I want to know what exact the error text means and Is there any solution I can apply to ensure my job will not run OOM. The wired thing is the job does not fail every time its behaviour is inconsistence.

That message is actually from linux kernel, not your job. It means that your system ran out of memory and the kernel has killed your job to resolve the problem (otherwise you'd probably get a kernel panic).
You could try modifying your app to lower memory requirements (e.g. load your data incrementally or write a distributed job that would complete needed transformations on the cluster, not just one machine).

Elasticsearch log file huge size performance degradation

I am using RoR to develop an application and a gem called searchkick, this gem internally uses elasticsearch. Everything works fine but on the production, we faced a weird issue, that after some time the site goes down. The reason we discovered was the memory on the server was being overused. We deleted some elasticsearch log files of the previous week and found out that the memory use was reduced to 47% from 92%. we use rolled logging, and logs are backed up each day. Now, the problem that we are facing is, with only 1 log file of the previous day, the memory grows higher. The log files are taking up a lot of space, even the current one takes 4GB!!!! How can I prevent that?
The messages are almost are warn level.
[00:14:11,744][WARN ][cluster.action.shard ] [Abdul Alhazred] [?][0] sending failed shard for [?][0], node[V52W2IH5R3SwhZ0mTFjodg], [P], s[INITIALIZING], indexUUID [4fhSWoV8RbGLj5jo8PVoxQ], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[?][0] failed recovery]; nested: EngineCreationFailureException[[?][0] failed to create engine]; nested: LockReleaseFailedException[Cannot forcefully unlock a NativeFSLock which is held by another indexer component: /usr/lib64/elasticsearch-1.1.0/data/elasticsearch/nodes/0/indices/?/0/index/write.lock]; ]]
Looking at some of the SO questions, I'm trying to increase the ulimit or create a new node, so that the problem is also solved and size reduces. My limits.conf has 65535 for hard and soft nofile. Also in sysctl.conf fs.file-max more that 100000. Is there any other step that I could take to reduce the file size, moreover I'm not able to get insight into elasticsearch config changes.
If anyone could help. thanks

I suggest an upgrade to at least 1.2.4, because of some file locking issues reported in Lucene: http://issues.apache.org/jira/browse/LUCENE-5612, http://issues.apache.org/jira/browse/LUCENE-5544.

Yes ElasticSearch and Lucene are both resource intensive. I did the following to rectify my system:
Stop ElasticSearch. if you start from command like
(bin/elasticsearch) then please specific this to set up heap while
starting. For ex, I use a 16GB box so my command is
a. bin/elasticsearch -Xmx8g -Xms8g
b. Go to config (elasticsearch/config/elasticsearch.yml) and ensure that
bootstrap.mlockall: true
c. Increase ulimits -Hn and ulimits -Sn to more than 200000
If you start as a service, then do the following
a. export ES_HEAP_SIZE=10g
b. Go to config (/etc/elasticsearch/elasticsearch.yml) and ensure that
bootstrap.mlockall: true
c. Increase ulimits -Hn and ulimits -Sn to more than 200000
Make sure that the size you enter is not more than 50% of the heap whether you start it as a service or from command line

Java out of memory errors along with HTMLUnit errors in Jmeter

I'm really new here and to Java in general since I am only a QA tester. So I am going to try and make this short, I found a solution to one of the problems we have had testing a website, Testing an AJAX website using Jmeter. I managed to get Jmeter to run a Junit request recorded with Selenium using HTMLunit after compiling the Jar file needed using Maven In order to try and load test a website.
Everything runs fine until I get to a somewhat "load worthy" number of threads (we are a large enough company have multiple machines so this shouldn't be an issue if each could handle 200-300 requests) but I am getting some errors in the Jmeter cmd along with CSS errors in the Jmeter error console (they may not be related, but they appear at the same time) after setting the threads to 50:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid6832.hprof ...
Heap dump file created [817281280 bytes in 13.116 secs]
Logging Error: Unknown error writing event.
java.lang.OutOfMemoryError: GC overhead limit exceeded
Logging Error: Unknown error writing event.
java.lang.OutOfMemoryError: GC overhead limit exceeded
Logging Error: Unknown error writing event.
java.lang.OutOfMemoryError: GC overhead limit exceeded
Logging Error: Unknown error writing event.
java.lang.OutOfMemoryError: GC overhead limit exceeded
Logging Error: Unknown error writing event.
java.lang.OutOfMemoryError: GC overhead limit exceeded
2013/02/04 14:01:10 WARN - com.gargoylesoftware.htmlunit.DefaultCssErrorHandler: CSS warning: 'Website URL+CSS PATH' [2447:70] Ignoring the whole rule.
2013/02/04 14:01:10 WARN - com.gargoylesoftware.htmlunit.DefaultCssErrorHandler: CSS error: 'Website URL+CSS PATH' [1684:57] Error in pseudo class or element. (Invalid token "2". Was expecting one of: <S>, <IDENT>, ")".)
These are the errors I get, I am guessing it is having a hard time getting the requests through (since I know htmlunit acts as if it was a browser and sending 100 requests at a time using that browser may be demanding too much of my current resources)
So, question is: I just wanted to make sure I was correct in assessing the problems I am seeing and also wondering if anyone has done something like this before and if maybe anyone has found a better way of handling these kinds of tests?

Most probably you're using JMeter in GUI mode for your load test which is an Anti-Pattern.
I guess you have View Results Tree listener in your plan, so this component ends up cumulating all results and ends with an OOM .
Since recent version of JMeter, only 500 last Sample Results are kept.
So upgrade to JMeter 4.0 or 5.0 AND use NON GUI mode:
jmeter -n -t testplan.jmx -l results.csv -e -o report_folder

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.