Elasticsearch log file huge size performance degradation

Elasticsearch log file huge size performance degradation - java

I am using RoR to develop an application and a gem called searchkick, this gem internally uses elasticsearch. Everything works fine but on the production, we faced a weird issue, that after some time the site goes down. The reason we discovered was the memory on the server was being overused. We deleted some elasticsearch log files of the previous week and found out that the memory use was reduced to 47% from 92%. we use rolled logging, and logs are backed up each day. Now, the problem that we are facing is, with only 1 log file of the previous day, the memory grows higher. The log files are taking up a lot of space, even the current one takes 4GB!!!! How can I prevent that?
The messages are almost are warn level.
[00:14:11,744][WARN ][cluster.action.shard ] [Abdul Alhazred] [?][0] sending failed shard for [?][0], node[V52W2IH5R3SwhZ0mTFjodg], [P], s[INITIALIZING], indexUUID [4fhSWoV8RbGLj5jo8PVoxQ], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[?][0] failed recovery]; nested: EngineCreationFailureException[[?][0] failed to create engine]; nested: LockReleaseFailedException[Cannot forcefully unlock a NativeFSLock which is held by another indexer component: /usr/lib64/elasticsearch-1.1.0/data/elasticsearch/nodes/0/indices/?/0/index/write.lock]; ]]
Looking at some of the SO questions, I'm trying to increase the ulimit or create a new node, so that the problem is also solved and size reduces. My limits.conf has 65535 for hard and soft nofile. Also in sysctl.conf fs.file-max more that 100000. Is there any other step that I could take to reduce the file size, moreover I'm not able to get insight into elasticsearch config changes.
If anyone could help. thanks

I suggest an upgrade to at least 1.2.4, because of some file locking issues reported in Lucene: http://issues.apache.org/jira/browse/LUCENE-5612, http://issues.apache.org/jira/browse/LUCENE-5544.

Yes ElasticSearch and Lucene are both resource intensive. I did the following to rectify my system:
Stop ElasticSearch. if you start from command like
(bin/elasticsearch) then please specific this to set up heap while
starting. For ex, I use a 16GB box so my command is
a. bin/elasticsearch -Xmx8g -Xms8g
b. Go to config (elasticsearch/config/elasticsearch.yml) and ensure that
bootstrap.mlockall: true
c. Increase ulimits -Hn and ulimits -Sn to more than 200000
If you start as a service, then do the following
a. export ES_HEAP_SIZE=10g
b. Go to config (/etc/elasticsearch/elasticsearch.yml) and ensure that
bootstrap.mlockall: true
c. Increase ulimits -Hn and ulimits -Sn to more than 200000
Make sure that the size you enter is not more than 50% of the heap whether you start it as a service or from command line

Related

JMETER: JMeter 5.3 java.lang.OutOfMemoryError. During Jmeter execution

I have configured a Testplan using Jmeter shown below in the image and have been using the CLI to run my parallel load tests. MAC USER
I have configured a connection with my AWS RedShift database, when I check my queries monitoring, all of the queries get stuck in a Running state.
After some time, on my terminal, i get the following error: JMeter 5.3 java.lang.OutOfMemoryError.
I have gone into my bin/jemeter file and have made the memory changes but I am still facing the same issue.
When I run the same queries from DBeaver, the queries are run and completed and can be seen on Redshift query monitoring.
How can I solve the memory problem in order for the queries to run without being stuck in a running state?
Below is the Error i am getting even after increasing the heap size to 5 gigabytes.
WARNING: package sun.awt.X11 not in java.desktop
Creating summariser <summary>
Created the tree successfully using //Users/mbyousaf/Desktop/redshit-test/test-redhsift.jmx
Starting standalone test # Wed Dec 02 14:53:17 GMT 2020 (1606920797442)
Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445
Warning: Nashorn engine is planned to be removed from a future JDK release
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid35596.hprof ...
Heap dump file created [3071802740 bytes in 3.747 secs]

Which exact OutOfMemoryError? There are several possible reasons:
Lack of heap space, if this is the case - you're looking at the right place, just make sure that your changes are applied
GC Overhead Limit Exceeded occurs when the GC executing almost 100% of time not leaving the program any chance to do its job
Requested array size exceeds VM limit when the program tries to create too large objects
Unable to Create New Native Thread when the program cannot create a new thread because the operating system doesn't allow it
and so on
It's not possible to state what's wrong without seeing your full test plan (at least screenshot) as it might be the case you added tons of Listeners and each of them stores large DB query response in memory and jmeter.log file (definitely not in the form of screenshot) which in the majority of cases contains either the cause of the problem or at least a clue

How to check why job gets killed on Google Dataflow ( possible OOM )

I've got the simple task. I've got a bunch of files ( ~100GB in total ), each line represents one entity. I have to send this entity to JanusGraph server.
2018-07-07_05_10_46-8497016571919684639 <- job id
After a while, I am getting OOM, logs say that Java gets killed.
From dataflow view, i can see the following logs:
Workflow failed. Causes: S01:TextIO.Read/Read+ParDo(Anonymous)+ParDo(JanusVertexConsumer) failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
From stackdriver view, I can see: https://www.dropbox.com/s/zvny7qwhl7hbwyw/Screenshot%202018-07-08%2010.05.33.png?dl=0
Logs are saying:
E Out of memory: Kill process 1180 (java) score 1100 or sacrifice child
E Killed process 1180 (java) total-vm:4838044kB, anon-rss:383132kB, file-rss:0kB
More here: https://pastebin.com/raw/MftBwUxs
How can I debug what's going on?

There is too few information to debug the issue right now, so I am providing general information about Dataflow.
The most intuitive way for me to find the logs is going to Google Cloud Console -> Dataflow -> Select name of interest -> upper right corner (errors + logs).
More detailed information about monitoring is described here (in beta phase).
Some basic clues to troubleshoot the pipeline, as well as the most common error messages, are described here.
If you are not able to fix the issue, update the post with the error information please.
UPDATE
Based on the deadline exceeded error and the information you shared, I think your job is "shuffle-bound" for memory exhaustion. According to this guide:
Consider one of, or a combination of, the following courses of action:
Add more workers. Try setting --numWorkers with a higher value when you run your pipeline.
Increase the size of the attached disk for workers. Try setting --diskSizeGb with a higher value when you run your pipeline.
Use an SSD-backed persistent disk. Try setting --workerDiskType="compute.googleapis.com/projects//zones//diskTypes/pd-ssd"
when you run your pipeline.
UPDATE 2
For specific OOM errors you can use:
--dumpHeapOnOOM will cause a heap dump to be saved locally when the JVM crashes due to OOM.
--saveHeapDumpsToGcsPath=gs://<path_to_a_gcs_bucket> will cause the heap dump to be uploaded to the configured GCS path on next worker restart. This makes it easy to download the dump file for inspection. Make sure that the account the job is running under has write permissions on the bucket.
Please take into account that heap dump support has some overhead cost and dumps can be very large. These flags should only be used for debugging purposes and always disabled for production jobs.
Find other references on DataflowPipelineDebugOptions methods.
UPDATE 3
I did not find public documentation about this but I tested that Dataflow scales the heap JVM size with the machine type (workerMachineType), which could also fix your issue. I am with GCP Support so I filed two documentation requests (one for a description page and another one for a dataflow troubleshooting page) to update the documents to introduce this information.
On the other hand, there is this related feature request which you might find useful. Star it to make it more visible.

SOLR 6.2 ignores heap settings (SOLR_JAVA_MEM param)

I've followed instructions from here and here to increase my SOLR memory allocation. I've done this because the SOLR server has shutdown periodically during some high frequency and high volume indexing activity.
I'm a little new to using SOLR and Ubuntu so bear with me, but I've found several locations where the SOLR_JAV_MEM parameter exist:
/opt/solr-6.2.0/bin/solr.in.sh
/opt/solr-6.2.0/bin/solr.in.cmd
/opt/solr-6.2.0/bin/solr.cmd
The same set of files in this directory: /home/deploy/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/sunspot_solr-2.2.5/solr/bin
And this directory:
/home/deploy/solr-6.2.0/bin
And finally, in this file: /etc/default/solr.in.sh
Anywhere I've seen a SOLR_JAV_MEM or SOLR_HEAP param with a number, I've replaced it with a larger value, for example in /opt/solr-6.2.0/bin/solr.in.sh:
# Increase Java Heap as needed to support your indexing / query needs
SOLR_HEAP="1500m"
# Expert: If you want finer control over memory options, specify them directly
# Comment out SOLR_HEAP if you are using this though, that takes precedence
#SOLR_JAVA_MEM="-Xms1512m -Xmx1512m"
If I'm measuring it correctly, I still only see about 500MB of memory allocated to SOLR, as seen by the following command:
root#ip-xxx:~# service solr status
Found 1 Solr nodes:
Solr process 15259 running on port 8989
{
"solr_home":"/var/solr/data",
"version":"6.2.0 764d0f19151dbff6f5fcd9fc4b2682cf934590c5 - mike - 2016-08-20 05:41:37",
"startTime":"2016-09-28T15:01:18.001Z",
"uptime":"0 days, 0 hours, 12 minutes, 28 seconds",
"memory":"100 MB (%20.4) of 490.7 MB"}
Am I doing something wrong? Or am I just measuring the memory incorrectly? Please let me know if I can provide add'l info. Thanks!

I'll answer my own question. It turned out that I had to edit the /etc/default/solr.in.sh file. I changed the SOLR_HEAP="512M" to SOLR_HEAP="1500m" and ran sudo service solr status and saw the memory showing 1.5G!

Webstart application fails to start with -Xmx2G on Java 8u60

I have a Java Webstart application that starts successfully with -Xmx1G, but fails to start with -Xmx2G. Some of my users really need 2G of heap.
This seems to be a problem with Java 8u60 only, because I have a report of someone launching successfully with Java 8u51.
The failure looks like this: I see the blue 'Java...' splash screen, and then after a few seconds, poof it's gone, before displaying the Java console and without producing any trace information in the expected place.
The failure occurs only on those clients with less than 2G of memory available. But, I am a little surprised that requesting a 'maximum' heap size could cause the application to fail so early and without any diagnostic information. We are dealing with a 'maximum' value, after all, not an 'initial' value. I read in multiple places that the JVM is not supposed to do this.
But I also remembered reading that the 'initial', if unspecified, is based on the maximum. So, along with passing -Xmx2G, I tried passing -Xms512M, -Xms256M, and -Xms128M. But, this attempt to shrink the initial heap size did not help. I cannot get this thing to start with -Xmx2G!
Does anyone have any light to shed on this situation? A solution? A workaround? In the short term, I'll change to -Xmx1G, but, as I said at the beginning, I have some users that really need -Xmx2G. I'd like to avoid having two separate *.jnlp files, which would also entail having two separate *.jar files!

Turns out that this is exactly what Webstart on Java8u60 does if the client machine does not have enough memory to satisfy -Xmx. It attempts to start, and then poof, it disappears without any indication as to what went wrong.
So, I will end up having to build my application in different configurations if I want to enable the users with more memory to allocate that memory to my application. This is because signing requires the *.jnlp file to into the *.jar file itself, and this *.jnlp file must be an exact match with the *.jnlp file used to launch the application.

Out of Memory on Tomcat Shutdown

Short description of my problem: I start up Tomcat with my deployed Wicket application. When I want to shut down tomcat I get this error message:
Error occurred during initialization of VM
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at java.lang.ref.Reference.<clinit>(Reference.java:145)
I am running the following setup:
Ubuntu Linux: 10.04 (lucid) with a 2.6.18-028stab094.3 kernel
Java Version: "1.6.0_26" Java HotSpot(TM) 64-Bit Server VM
Tomcat Version: 7.0.23
jvm_args: -Xms512m -Xmx512m -XX:MaxPermSize=205m (these are added via CATALINA_OPTS, nothing else)
Wicket 1.5.1
Tomcat is configured with two virtual hosts on subdomains with ModProxy
My application is deployed as ROOT.war in the appbase directory (it makes no difference if I deploy one or both applications)
'''No application deployed does not result in OOM on shutdown''', unless I mess around with the jvm args
The size of the war is about 500k, all libraries are deployed in tomcat/common/lib (directory which I added to common.loader in conf/catalina.properties)
ulimit -u -> unlimited
When I check the Tomcat manager app it says the following about the JVM memory:
Free memory: 470.70 MB Total memory: 490.68 MB Max memory: 490.68 MB
(http connector) Max threads: 200 Current thread count: 6 Current thread busy: 1
'top' or 'free -m' is similar:
Mem: 2097152k total, 1326772k used, 770380k free, 0k buffers
20029 myuser 18 0 805m 240m 11m S 0 11.7 0:19.24 java
I tried to start jmap to get a dump of the heap, it also fails with an OutOfMemoryError. Actually as long as one or both of my applications are deployed any other java process fails with the same OOM Error (see top).
The problem occurs while the application is deployed. So something is seriously wrong with it. However the application is actually running smoothly for quite a while. But I have seen OOMs in the application as well, so I don't trust the calm.
My application is using a custom filter class? Could that be it?
For completeness (hopefully), here's the list of libraries in my common/lib:
activation-1.1.jar
antlr-2.7.6.jar
antlr-runtime-3.3.jar
asm-3.1.jar
asm-commons-3.1.jar
asm-tree-3.1.jar
c3p0-0.9.1.1.jar
commons-collections-3.1.jar
commons-email-1.2.jar
dependencies-provided.tgz
dom4j-1.6.1.jar
ejb3-persistence-1.0.2.GA.jar
geronimo-annotation_1.0_spec-1.1.1.jar
geronimo-jaspic_1.0_spec-1.0.jar
geronimo-jta_1.1_spec-1.1.1.jar
hibernate-annotations-3.4.0.GA.jar
hibernate-commons-annotations-3.1.0.GA.jar
hibernate-core-3.3.0.SP1.jar
hibernate-entitymanager-3.4.0.GA.jar
hibernate-search-3.1.0.GA.jar
javassist-3.4.GA.jar
joda-time-1.6.2.jar
jta-1.1.jar
log4j-1.2.16.jar
lombok-0.9.3.jar
lucene-core-2.4.0.jar
mail-1.4.1.jar
mysql-connector-java-5.1.14.jar
persistence-api-1.0.jar
quartz-2.1.1.jar
servlet-api-2.5.jar
slf4j-api-1.6.1.jar
slf4j-log4j12-1.6.1.jar
stringtemplate-4.0.2.jar
wicket-auth-roles-1.5.1.jar
wicket-core-1.5.1.jar
wicket-datetime-1.5.1.jar
wicket-extensions-1.5.1.jar
wicket-request-1.5.1.jar
wicket-util-1.5.1.jar
xml-apis-1.0.b2.jar
I appreciate any hint or even speculation that gives me additional ideas what to try.
Update: I tested some more and found that this behaviour only occurs while one or both of my applications are deployed. The behaviour does not occur on "empty" tomcat (that was a mistake on my part messing with jvm args)
Update2: I am currently experimenting trying to reproduce this behaviour in a virtual box, I want to debug this with a profiler. I am still not convinved that it should be impossible to run my setup on 2GB RAM.
Update3 (10/01/12): I am trying to run jenkins instead of my own application. Same behaviour, so it is definitely server configuration issues. Jenkins jobs fail when maven is called, so I need not even try the shutdown hack suggested below because I need a second java process while running Jenkins. It was suggested to me that because this is a Virtual Server ulimits may be imposed from outside and I would not be able to see them. I think I'll ask a new question regarding this. Thx all.
Update4 (02/05/12): see below for the answer that contains the hint. I'll clarify some more up here: I am now 95% sure that the errors occur because I am reaching my thread limit. However because this is a virtual server the method described below would not work to check this value because it is not visible with ulimit, that was what was confusing me and only today I found out that this is the "numproc" value that I can see in the Parallels Power Panel that I can log into for my virtual server. There were Resource Alerts for numproc but I did not see those either until just now. The value has a hard limit of 96 which I cannot change of course. The current value of numproc corresponds to the number of processes I see with "top" after toggling "H" to see threads. I had a very hard time finding this because this numproc value is hidden deep inside the panel. Sadly 96 is a rather low number if you want to run a tomcat with apache and mysql. I am also very sad that I cannot even find this value in the small print of my hosting contract and it is rather relevant to my application I dare say. So I guess I'll need a server upgrade.
Thanks all for your helpful answers in the end everyone helped me a bit to find out what the problem was.

The tomcat shutdown procedure consits of sending an command/word via a tcp port to the running tomcat VM. This port is configured in the server.xml (if I remember corretly, writting on my phone right now). So far so good.
Unfortunately, the shutdown script does this by starting a 2. VM using the same java options used for the tomcat. Your system simply has not enough memory for this.
As a sollution you could write your own stop script using telnet or something.
I could help with later if needed.
Hope that helps.
Viele grüsse Bert

Seems you have too many threads open.
Use this command :
ulimit -u
What is the result ?
Should be something like :
max user processes (-u) 100
If this is correct, you can edit this file :
/etc/security/limits.conf
and the the following modifications :
#<domain> <type> <item> <value>
user soft nproc 10000
user hard nproc 10000

You can probably survive for a while like this. All you need to do is kill the tomcat process whenever you need to restart it. It is not a nice approach, but the main concern is that your application runs correctly.
It seems to me though, that on the long run, you might need to order a hosting plan with more RAM available.

I was having a similar problem with a tomcat installation just last week. I managed to fix it by giving tomcat a smaller heap. Something like this:
export CATALINA_OPTS=”-Xms256m -Xmx512m”
Before starting Tomcat may help. In the meantime you'll have to kill it the old fashioned way, with a kill -9 ;)
EDIT: you could also take look here, it appears tomcat automatically creates a bunch of "spare" threads, but you can limit those as well as your max thread count in the config. Hope it helps.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.