We just did a rolling restart of our server, but now every few hours our cluster stops responding to API calls. Instead, when we make a call, I get a response like this:
curl -XGET 'http://localhost:9200/_cluster/health?pretty'
{
"error" : "OutOfMemoryError[unable to create new native thread]",
"status" : 500
}
I noticed that we can still index data fine, it seems, but cannot search or call any API functions. This seems to happen every few hours, and the most recent time it happened, there was no logs in any of the node's log files.
Our cluster is 8 nodes over 5 servers (3 servers run 2 ElasticSearch processes, 2 run 1), running RHEL6u5. We are running ElasticSearch1.3.4.
This can be due to the OS not allowing more threads to be created. Increasing the number of threads per process can solve the problem. Set the ulimit -u value higher.
Although the above is good, an even better solution is to configure ElasticSearch to use a thread pool. This is a better solution since thread creation and destruction are expensive. In fact, some (or all?) JVMs can't cleanup terminated threads except during a full GC.
Related
TLDR:
JMeter crawls with just 1 test user if I select "Retrieve all embedded resources".
JMeter throws Uncaught Exception java.lang.OutOfMemoryError: Java heap space in thread Thread if I test with 100 users
How to eventually scale to test with thousands of users?
More details:
I'm recording the jmx script using the BlazeMeter Chrome extension. The user logs in, and completes a course and a test. There are many ajax requests being made along the way. Around 350 - 400 steps for the script. Mostly ajax POST request that have a json response.
When I click through manually in Chrome, the site loads quickly. Maybe a page load takes 2 seconds tops.
But when I import that script into JMeter with the "Retrieve all embedded resources" and "Parallel downloads" set to 6, and run it (in the GUI initially with just 1 user), it will get through, say, 7 steps quickly, and then just hang, sometimes for 10+ minutes before advancing to the next step. This doesn't happen if I uncheck "Retrieve all embedded resources", but I don't want to do that since that wouldn't be a realistic test.
If I take that same test and run it with 100 users (from the command line using JVM_ARGS='-Xms4096m -Xmx4096m' sh jmeter -n -t myfolder/mytest.jmx -l myfolder/testresults.jtl), I get Uncaught Exception java.lang.OutOfMemoryError: Java heap space in thread Thread and my computer fan goes nuts.
I have an HTTP Cache Manager configured with only "Use Cache-Control/Expires header when processing GET requests" checked and I've lowered the "Max Number of elements in cache down to 10 since that's the only way I can get the test running at all.
Ideally, I'd like to test thousands of users at once, but if I can't reliably get 100 users tested, I don't understand how I'm supposed to scale to thousands.
I see that there are other cloud-based testing options, and I've tried a few out now, however I always come to a halt when it comes to configuring how to test logged in users. Seems like most solutions don't support this.
Feels like the kind of thing that lots of people should have come across in the past, but I find almost no one having these issues. What's the right way to load test thousands of logged-in users on a web application?
If JMeter "hangs" for 10 minutes retrieving an embedded resource there is a problem with this specific resource, you can see the response times for each resource using i.e. View Results Tree listener and either raise a bug for this particular "slow" component or exclude it from the scope using HTTP Request Defaults. There you can also specify the timeout, if the request won't return the response within the given timeframe - it will be marked as failed so you won't have to wait for 10 minutes:
Your way of increasing the heap is correct, it looks like 4Gb is not sufficient you will either have to provide more or consider switching to distributed testing
Make sure to follow JMeter Best Practices in any case. If your "computer fan goes nuts" it means that CPU usage is very high therefore JMeter most probably won't be able to send requests fast enough so you will get false-negative results.
I have a simple JMeter experiment with a single Thread Group with 16 threads, running for 500s, hitting the same URL every 2 seconds on each thread, generating 8 requests/second. I'm running in non-GUI (Command Line) mode. Here is the .jmx file:
https://www.dropbox.com/s/l66ksukyabovghk/TestPlan_025.jmx?dl=0
Here is a plot of the result, running on an AWS m5ad.2xlarge / 8 cores / 32GB RAM (I get the same behavior on VirtualBox Debian on my PC, very large Hetzner server, Neocortix Cloud Services instances):
https://www.dropbox.com/s/gtp6oqy0xtuybty/aws.png?dl=0
At the beginning of the Thread Group, all 16 threads report a long response time (0.33s), then settle in to a normal short response time (<0.1s). I call this the "Start of Run" problem.
Then about 220s later, there is another burst of 16 long response times, and yet another burst at about 440s. I call those the "Start of Run Echo" problem, because they look like echoes of the "Start of Run" problem. The same problem occurs if I introduce another Thread Group with a delay, say 60s. That Thread Group gets its own "Start of Run" problem at t=60s, and then its own echos at 280s and 500s.
These two previous posts seem related, but no conclusive cause was given for the "Start of Run" problem, and the "Start of Run Echo" problem was not mentioned.
Jmeter - The time taken by first iteration of http sampler is large
First HTTP Request taking a long time in JMeter
I can work around the "Start of Run" problem by hitting a non-existent page with the first HTTP request in each thread, getting a 404 Error, and filtering out the 404's. But that is a hack, and it doesn't solve the "Start of Run Echo" problem, which is not guaranteed to hit the non-existent pages. And it introduces "holes" in the delivered load to the real target pages.
Update: After suggestion from Dmitri T, I have installed JMeter 5.3. It has default value httpclient4.time_to_live=60000 (60s), and its output matches that:
https://www.dropbox.com/s/gfcqhlfq2h5asnz/hetzner_60.png?dl=0
But if I increase the value of httpclient4.time_to_live=600000 (600s), it does not push all the "echoes" out past the end of the run. It still shows echoes at about 220s and 440s, i.e. the same original behavior that I am trying to eliminate.
https://www.dropbox.com/s/if3q652iyiyu69b/hetzner_600.png?dl=0
I am wondering if httpclient4.time_to_live has an effective maximum value of 220000 (220s) or so.
Thank you,
Lloyd
The first request will be slow due to initial connection establishment and SSL handshake
Going forward JMeter will act according to its network properties in particular:
httpclient4.time_to_live - TTL (in milliseconds) represents an absolute value. No matter what, the connection will not be re-used beyond its TTL.
httpclient.reset_state_on_thread_group_iteration - Reset HTTP State when starting a new Thread Group iteration which means closing opened connection and resetting SSL State
also it seems that you're using kind of outdated JMeter version which is 5 years old, according to JMeter Best Practices you should always be using the latest version of JMeter so consider upgrading to JMeter 5.3 (or whatever is the latest stable version available from JMeter Downloads page) as you might be suffering from a JMeter bug which has been resolved already.
It might also be the case you need to perform OS and JMeter tuning, see Concurrent, High Throughput Performance Testing with JMeter for example problems and solutions
I am running a Tomcat/Spring server on my local machine, and I'm running another program which sends POST requests to the server one after the other in quick succession. Each request causes an update on a remote Postgresql database, and finishes (flushes a response code) before the next one is sent.
After about 100 requests, the server starts taking longer to respond. After about 200 requests, the server stops responding and takes all the available processing until I manually kill it, either in Eclipse or the Windows Task Manager.
At one point the server spat out an error about garbage collection, but I haven't seen it the last few times I've tried this. When I saw this garbage collection error, I added a 5-second pause every 20 requests to try to give the server time for garbage collection, but it doesn't seem to have helped.
How can I track down and resolve the cause of this server overload?
Did you monitored memory usage? But here are some ways to track down the problem:
Track its memory usage with time in task manager.
Try take thread dumps from time to time.
Use visualVM to track cpu usage.
See if DB connections are getting closed.
There could be many more things that can cause it but visualVM can be your tool to start diagnosing these issues.
We have a cron job that runs every hour on a backend module and creates tasks. The cron job runs queries on the Cloud SQL database, and the tasks make HTTP calls to other servers and also update the database. Normally they run great, even when thousands of tasks as created, but sometimes it gets "stuck" and there is nothing in the logs that can shed some light on the situation.
For example, yesterday we monitored the cron job while it created a few tens of tasks and then it stopped, along with 8 of the tasks that also got stuck in the queue. When it was obvious that nothing was happening we ran the process a few more times and each time completed successfully.
After a day the original task was killed with a DeadlineExceededException and then the 8 other tasks, that were apparently running in the same instance, were killed with the following message:
A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)
Until the processes were killed we saw absolutely no record of them in the logs, and now that we see them there are no log records before the time of the DeadlineExceededException, so we have no idea at what point they got stuck.
We suspected that there is some lock in the database, but we see in the following link that there is a 10 minute limit for queries, so that would cause the process to fail much sooner than one day: https://cloud.google.com/appengine/docs/java/cloud-sql/#Java_Size_and_access_limits
Our module's class and scaling configuration is:
<instance-class>B4</instance-class>
<basic-scaling>
<max-instances>11</max-instances>
<idle-timeout>10m</idle-timeout>
</basic-scaling>
The configuration of the queue is:
<rate>5/s</rate>
<max-concurrent-requests>100</max-concurrent-requests>
<mode>push</mode>
<retry-parameters>
<task-retry-limit>5</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
</retry-parameters>
I uploaded some images of the trace data for the cron job:
http://imgur.com/a/H5wGG.
This includes the trace summary, and the beginning/ending of the timeline.
There is no trace data for the 8 terminated tasks.
What could be the cause of this and how can we investigate it further?
We eventually managed to solve the problem with the following steps:
We split the module into two - one module to run the cron job and
one module to handle the generated tasks. This allowed us to see
that the problem was with handling the tasks as that was the only
module that kept getting stuck.
We limited the amount of concurrent tasks to 2, which seems to be the maximum amount of tasks that can be processed simultaneously without the system getting stuck.
I have a Tomcat Java application running on OpenShift (1 small gear) that consists of two main parts: A cron job that runs every minute, parses information from the web and saves it into a MongoDB database, and some servlets to access that data.
After deploying the app, it runs fine, but sooner or later the server will stop and I cannot access the servlets anymore (the HTTP request takes very long, and if it finishes, it returns a Proxy Error). I can only force stop the app using the rhc command line and restart it.
When I look at the jbossews.log file, I see multiple occurences of this error:
Exception in thread "http-bio-127.5.35.129-8080-Acceptor-0" java.lang.OutOfMemoryError:
unable to create new native thread
Is there anything I can do to prevent this error without needing to upgrade to a larger gear with more memory?
According your description I can understand that some memory leak issue is their with your app . That may be because that you are not stooping your threads.
Sometimes what happen is thread will not stop automatically then we need to stop the thread explicitly.
I guess, itss not a memory problem, but OS resource problem. You are running out of native threads, max threads your JVM can have.
You can increase it by this way
ulimit -s newvalue