jvm options -XX:+SafepointTimeout -XX:SafepointTimeoutDelay look don't work - java

I detected on a server long safepoints (>10sec!) in jvm safepoint.log:
6534.953: no vm operation [ 353 0 4 ] [ 0 0 14179 0 0 ] 0
7241.410: RevokeBias [ 357 0 1 ] [ 0 0 14621 0 0 ] 0
8501.278: BulkRevokeBias [ 356 0 6 ] [ 0 0 13440 0 2 ] 0
9667.681: no vm operation [ 349 0 8 ] [ 0 0 15236 0 0 ] 0
12094.170: G1IncCollectionPause [ 350 0 4 ] [ 0 0 15144 1 24 ] 0
13383.412: no vm operation [ 348 0 4 ] [ 0 0 15783 0 0 ] 0
13444.109: RevokeBias [ 349 0 2 ] [ 0 0 16084 0 0 ] 0
On my laptop I've played with -XX:SafepointTimeoutDelay=2
and it works good, printing bad threads:
# SafepointSynchronize::begin: Timeout detected:
...
# SafepointSynchronize::begin: (End of list)
<writer thread='11267'/>
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
567.766: BulkRevokeBias [ 78 1 2 ] [ 0 6 6 0 0 ] 0
So I've added to options to the server:
-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=1000
to see which threads cause the problem, but I don't see any prints, while I still see long safepoint times.
Why doesn't it applied on the server?
Here is the actual server config (taken from safepoint.log):
Java HotSpot(TM) 64-Bit Server VM (25.202-b08) for linux-amd64 JRE (1.8.0_202-b08), built on Dec 15 2018 12:40:22 by "java_re" with gcc 7.3.0
...
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=10
-XX:+UnlockDiagnosticVMOptions
-XX:+LogVMOutput
-XX:LogFile=/opt/pprb/card-pro/pci-pprb-eip57/logs/safepoint.log
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=1000
...

In safepoint, "Total time for which application threads were stopped: 18.0049752 seconds, Stopping threads took: 18.0036770 seconds" maybe caused by a thread wait for lock, and maybe not.
When SafepointTimeoutDelay=1000, if more than one thread wait for 1s, there will invoke SafepointSynchronize::print_safepoint_timeout method in safepoint.cpp to print certain ThreadSafepointState.
But when all thread come to safepoint and other reason to hold in 18s, the method will be not called and no logs for it.
We can set safepoint=trace in jdk9+ to know all thread state in gc log.

Related

A list of Hotspot VM Operations with descriptions

Java Hotspot VM can do a number of different VM operations. When debugging safepoint times it's useful to know what was the purpose of the safepoint. Some of them are obvious: G1IncCollectionPause or FindDeadlocks, but some are not: CGC_Operation, no vm operation. There is VMOps.java, but it only lists possible values, not what they mean.
Currently, I need to know what CGC_Operation does in context of G1GC. I suspect that it is related to ConcurrentGCThread and Old gen collection, but I would like to confirm and also have some references to look for other operations.
Example:
-XX:+PrintSafepointStatistics
...
128959.961: G1IncCollectionPause [ 2636 0 1 ] [ 0 0 0 15 52 ] 0
129986.695: G1IncCollectionPause [ 2637 0 0 ] [ 0 0 0 12 51 ] 0
137019.250: G1IncCollectionPause [ 2636 0 0 ] [ 0 0 0 13 50 ] 0
138693.219: CGC_Operation [ 2636 0 0 ] [ 0 0 0 13 338 ] 0
138726.672: G1IncCollectionPause [ 2636 0 0 ] [ 0 0 0 13 50 ] 0
138733.984: G1IncCollectionPause [ 2636 0 1 ] [ 0 0 0 13 50 ] 0
138738.750: G1IncCollectionPause [ 2636 0 0 ] [ 0 0 0 13 62 ] 0
The best (probably, the only) documentation is the source code. Fortunately, HotSpot JVM sources are very well commented.
See src/share/vm/gc_implementation/g1/vm_operations_g1.hpp:
// Concurrent GC stop-the-world operations such as remark and cleanup;
// consider sharing these with CMS's counterparts.
class VM_CGC_Operation: public VM_Operation {
no vm operation denotes the special type of periodic safepoint for various cleanup activities, see this answer

Enable Debug mode for JVM or get debug build for jdk8?

I see intermittent pauses in our application where the sync time for a safe point is high.
2015-07-27T03:05:18.478-0600: 734.948: Application time: 4.1685198
seconds
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
733.780: no vm operation [ 343 0 0 ] [ 0 0 1168 0 0 ] 0 0
2015-07-27T03:05:19.647-0600: 734.949: : Total time for which
application threads were stopped: 0.0012189 seconds
This leads to a few timeouts as the latency desired is ~100ms.
I want to know how do I debug the high sync time. Can I get a debug build for jdk8 or any other way?

Java Safepoint: RevokeBias

We are facing an issue with a java applications where lot of safepoints are getting triggered (almost 1/sec). I have enabled GC logging with -"XX:+PrintGCApplicationStoppedTime
-XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1" flag.
and found out that almost every seconds safepoint condition is triggering because of "RevokeBias" . It is happening very frequently and causing issues with the application.
Can someone tell me what could cause more "RevokeBias" in the application and what should we do to improve this behavior ?
From GC logs:
3.039: Total time for which application threads were stopped: 0.0001610 seconds
24.039: Total time for which application threads were stopped: 0.0001640 seconds
From console logs:
7475.858: **RevokeBias** [ 161 0 0 ] [ 0 0 0 0 0 ] 0
vmop [threads: total initially_running wait_to_block]
[time: spin block sync cleanup vmop] page_trap_count
#400000005548c3710bf0250c
7475.859: **RevokeBias** [ 161 0 0 ] [ 0 0 0 0 0 ] 0
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
7475.859: **RevokeBias** [ 161 0 0 ] [ 0 0 0 0 0 ] 0
Thanks,
Anuj

Solr Replication leaking some memory?

Lately we have discovered that the JBoss process on our Linux server was shut down by OS, due to high memory consumption (about 2.3 GB). Here is the dump:
RPC: fragment too large: 0x00800103
RPC: multiple fragments per record not supported
RPC: fragment too large: 0x00800103
RPC: multiple fragments per record not supported
RPC: fragment too large: 0x00800103
RPC: multiple fragments per record not supported
RPC: fragment too large: 0x00800103
RPC: multiple fragments per record not supported
RPC: fragment too large: 0x00800103
RPC: multiple fragments per record not supported
RPC: fragment too large: 0x00800103
RPC: multiple fragments per record not supported
RPC: fragment too large: 0x00800103
RPC: multiple fragments per record not supported
java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
java cpuset=/ mems_allowed=0
Pid: 11445, comm: java Not tainted 2.6.32-431.el6.x86_64 #1
Call Trace:
[<ffffffff810d05b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122960>] ? dump_header+0x90/0x1b0
[<ffffffff8122798c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122de2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122d21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123220>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fb3c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
[<ffffffff81167a9a>] ? alloc_pages_current+0xaa/0x110
[<ffffffff8111fd57>] ? __page_cache_alloc+0x87/0x90
[<ffffffff8111f73e>] ? find_get_page+0x1e/0xa0
[<ffffffff81120cf7>] ? filemap_fault+0x1a7/0x500
[<ffffffff8114a084>] ? __do_fault+0x54/0x530
[<ffffffff810afa17>] ? futex_wait+0x227/0x380
[<ffffffff8114a657>] ? handle_pte_fault+0xf7/0xb00
[<ffffffff8114b28a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff81527910>] ? thread_return+0x4e/0x76e
[<ffffffff8152d45e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152a815>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 178
CPU 1: hi: 186, btch: 31 usd: 30
Node 0 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 174
CPU 1: hi: 186, btch: 31 usd: 194
active_anon:113513 inactive_anon:184789 isolated_anon:0
active_file:21 inactive_file:0 isolated_file:0
unevictable:0 dirty:10 writeback:0 unstable:0
free:17533 slab_reclaimable:4706 slab_unreclaimable:8059
mapped:64 shmem:4 pagetables:3064 bounce:0
Node 0 DMA free:15696kB min:248kB low:308kB high:372kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15300kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3000 4010 4010
Node 0 DMA32 free:41740kB min:50372kB low:62964kB high:75556kB active_anon:200648kB inactive_anon:216504kB active_file:20kB inactive_file:52kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3072160kB mlocked:0kB dirty:8kB writeback:0kB mapped:168kB shmem:0kB slab_reclaimable:3720kB slab_unreclaimable:2476kB kernel_stack:512kB pagetables:516kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:108 all_unreclaimable? yes
lowmem_reserve[]: 0 0 1010 1010
Node 0 Normal free:12696kB min:16956kB low:21192kB high:25432kB active_anon:253404kB inactive_anon:522652kB active_file:64kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:32kB writeback:0kB mapped:88kB shmem:16kB slab_reclaimable:15104kB slab_unreclaimable:29760kB kernel_stack:3704kB pagetables:11740kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:146 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 4*4kB 2*8kB 3*16kB 4*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15696kB
Node 0 DMA32: 341*4kB 277*8kB 209*16kB 128*32kB 104*64kB 54*128kB 33*256kB 13*512kB 0*1024kB 1*2048kB 0*4096kB = 41740kB
Node 0 Normal: 2662*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 12696kB
64603 total pagecache pages
64549 pages in swap cache
Swap cache stats: add 3763837, delete 3699288, find 1606527/1870160
Free swap = 0kB
Total swap = 1048568kB
1048560 pages RAM
67449 pages reserved
1061 pages shared
958817 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 419] 0 419 2662 1 1 -17 -1000 udevd
[ 726] 0 726 2697 1 1 -17 -1000 udevd
[ 1021] 0 1021 4210 40 1 0 0 vmware-guestd
[ 1238] 0 1238 23294 28 1 -17 -1000 auditd
[ 1254] 65 1254 112744 203 1 0 0 nslcd
[ 1267] 0 1267 62271 123 1 0 0 rsyslogd
[ 1279] 0 1279 2705 32 1 0 0 irqbalance
[ 1293] 32 1293 4744 16 1 0 0 rpcbind
[ 1311] 29 1311 5837 2 0 0 0 rpc.statd
[ 1422] 81 1422 5874 36 0 0 0 dbus-daemon
[ 1451] 0 1451 1020 1 0 0 0 acpid
[ 1460] 68 1460 9995 129 0 0 0 hald
[ 1461] 0 1461 5082 2 1 0 0 hald-runner
[ 1490] 0 1490 5612 2 1 0 0 hald-addon-inpu
[ 1503] 68 1503 4484 2 0 0 0 hald-addon-acpi
[ 1523] 0 1523 134268 53 0 0 0 automount
[ 1540] 0 1540 1566 1 0 0 0 mcelog
[ 1552] 0 1552 16651 27 1 -17 -1000 sshd
[ 1560] 0 1560 5545 26 0 0 0 xinetd
[ 1568] 38 1568 8202 33 0 0 0 ntpd
[ 1584] 0 1584 21795 56 0 0 0 sendmail
[ 1592] 51 1592 19658 32 0 0 0 sendmail
[ 1601] 0 1601 29324 21 1 0 0 crond
[ 1612] 0 1612 5385 5 1 0 0 atd
[ 1638] 0 1638 1016 2 0 0 0 mingetty
[ 1640] 0 1640 1016 2 1 0 0 mingetty
[ 1642] 0 1642 1016 2 0 0 0 mingetty
[ 1644] 0 1644 2661 1 1 -17 -1000 udevd
[ 1645] 0 1645 1016 2 0 0 0 mingetty
[ 1647] 0 1647 1016 2 1 0 0 mingetty
[ 1649] 0 1649 1016 2 1 0 0 mingetty
[25003] 0 25003 26827 1 1 0 0 rpc.rquotad
[25007] 0 25007 5440 2 1 0 0 rpc.mountd
[25045] 0 25045 5773 2 1 0 0 rpc.idmapd
[31756] 0 31756 43994 12 0 0 0 httpd
[31758] 48 31758 45035 205 0 0 0 httpd
[31759] 48 31759 45035 210 1 0 0 httpd
[31760] 48 31760 45035 201 1 0 0 httpd
[31761] 48 31761 45068 211 1 0 0 httpd
[31762] 48 31762 45068 199 0 0 0 httpd
[31763] 48 31763 45035 196 0 0 0 httpd
[31764] 48 31764 45068 191 1 0 0 httpd
[31765] 48 31765 45035 206 1 0 0 httpd
[ 1893] 0 1893 41344 2 0 0 0 su
[ 1896] 500 1896 26525 2 0 0 0 standalone.sh
[ 1957] 500 1957 570217 81589 0 0 0 java
[10739] 0 10739 41344 2 0 0 0 su
[10742] 500 10742 26525 2 0 0 0 standalone.sh
[10805] 500 10805 576358 77163 0 0 0 java
[13378] 0 13378 41344 2 0 0 0 su
[13381] 500 13381 26525 2 1 0 0 standalone.sh
[13442] 500 13442 561881 73430 1 0 0 java
Out of memory: Kill process 10805 (java) score 141 or sacrifice child
Killed process 10805, UID 500, (java) total-vm:2305432kB, anon-rss:308648kB, file-rss:4kB
It was shut down at about 04:00 in the morning, when there were no users and no activity on the server, besides Solr replication. It was the master node, which has failed, and our slave pings it every minute. Here is the replication config:
<requestHandler name="/replication" class="solr.ReplicationHandler" >
<lst name="master">
<str name="enable">${solr.enable.master:false}</str>
<str name="replicateAfter">commit</str>
<str name="replicateAfter">startup</str>
<str name="confFiles">schema.xml,stopwords.txt</str>
</lst>
<lst name="slave">
<str name="enable">${solr.enable.slave:false}</str>
<str name="masterUrl">${solr.master.url:http://localhost:8080/solr/cstb}</str>
<str name="pollInterval">00:00:60</str>
</lst>
</requestHandler>
Since there were no users activity there were no changes in indexes and thus Solr should not actually do anything (I assume).
Some other values from config file:
<indexDefaults>
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<lockType>native</lockType>
</indexDefaults>
<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>32</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>
<unlockOnStartup>false</unlockOnStartup>
<reopenReaders>true</reopenReaders>
<deletionPolicy class="solr.SolrDeletionPolicy">
<str name="maxCommitsToKeep">1</str>
<str name="maxOptimizedCommitsToKeep">0</str>
</deletionPolicy>
<infoStream file="INFOSTREAM.txt">false</infoStream>
</mainIndex>
<queryResultWindowSize>20</queryResultWindowSize>
<queryResultMaxDocsCached>200</queryResultMaxDocsCached>
So, have anybody experienced similar situation or have any thoughts about it? We are using Solr 3.5.
You are running into a low memory condition that is causing Linux to kill off a high memory usage process:
Out of memory: Kill process 10805 (java) score 141 or sacrifice child
This is known as the out of memory killer or OOM. Given that you are only using 512MB for heap for the JVM (way too low in my opinion for a production Solr instance of any significant capacity) you don't have a lot of options as you cannot reduce heap to free up more OS memory.
Things you can try:
Upgrade to a larger server with more memory. This would be my number one recommendation - you simply don't have enough memory available.
Move any other production code to another system. You did not
mention if you have anything else running on this server but I would
move anything I could elsewhere. Not a lot to gain here as I suspect
your system is quite small to being with, but every little bit
helps.
Try tuning the OOM killer to be less strict - not that easy to do and I don't know what you will gain due to overall low server size but you can always experiment:
https://unix.stackexchange.com/questions/58872/how-to-set-oom-killer-adjustments-for-daemons-permanently
http://backdrift.org/how-to-create-oom-killer-exceptions
http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

Preventing Linux Kernel from killing Java process with really large heap [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
Running Ubuntu 12.04.3 LTS with 32 cores 244GB. Its the Amazon EC2 memory instance the big one and Java 1.7u25
My java process is running with -Xmx226g
I'm trying to create a really large local cache using CQEngine and so far its blazingly fast with 30,000,000 records. Of course I will add an eviction policy that will allow garbage collection to clean up old objects evicted, but really trying to push the limits here :)
When looking at jvisualvm, the total heap is at about 180GB which dies 40GB to soon. I should be able to squeeze out a bit more.
Not that I don't want the kernel to kill a process if it runs out of resources but I think it's killing it to early and want to squeeze the mem usage as much as possible.
The ulimit output is as follows...
ubuntu#ip-10-156-243-111:/var/log$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1967992
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1967992
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
The kern.log output is...
340 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
63999984 pages RAM
1022168 pages reserved
649 pages shared
62830686 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 505] 0 505 4342 93 9 0 0 upstart-udev-br
[ 507] 0 507 5456 198 2 -17 -1000 udevd
[ 642] 0 642 5402 155 28 -17 -1000 udevd
[ 643] 0 643 5402 155 29 -17 -1000 udevd
[ 739] 0 739 3798 49 10 0 0 upstart-socket-
[ 775] 0 775 1817 124 25 0 0 dhclient3
[ 897] 0 897 12509 152 10 -17 -1000 sshd
[ 949] 101 949 63430 91 9 0 0 rsyslogd
[ 990] 102 990 5985 90 8 0 0 dbus-daemon
[ 1017] 0 1017 3627 40 9 0 0 getty
[ 1024] 0 1024 3627 41 10 0 0 getty
[ 1029] 0 1029 3627 43 6 0 0 getty
[ 1030] 0 1030 3627 41 3 0 0 getty
[ 1032] 0 1032 3627 41 1 0 0 getty
[ 1035] 0 1035 1083 34 1 0 0 acpid
[ 1036] 0 1036 4779 49 5 0 0 cron
[ 1037] 0 1037 4228 40 8 0 0 atd
[ 1045] 0 1045 3996 57 3 0 0 irqbalance
[ 1084] 0 1084 3627 43 2 0 0 getty
[ 1085] 0 1085 3189 39 11 0 0 getty
[ 1087] 103 1087 46916 300 0 0 0 whoopsie
[ 1159] 0 1159 20490 215 0 0 0 sshd
[ 1162] 0 1162 1063575 263 15 0 0 console-kit-dae
[ 1229] 0 1229 46648 153 4 0 0 polkitd
[ 1318] 1000 1318 20490 211 10 0 0 sshd
[ 1319] 1000 1319 6240 1448 1 0 0 bash
[ 1816] 1000 1816 70102543 62010032 4 0 0 java
[ 1947] 0 1947 20490 214 6 0 0 sshd
[ 2035] 1000 2035 20490 210 0 0 0 sshd
[ 2036] 1000 2036 6238 1444 13 0 0 bash
[ 2179] 1000 2179 13262 463 2 0 0 vi
Out of memory: Kill process 1816 (java) score 987 or sacrifice child
Killed process 1816 (java) total-vm:280410172kB, anon-rss:248040128kB, file-rss:0kB
The kern.log clearly states that it killed my process because it ran out of memory. But like I said I think I can squeeze it a bit more. Is there any settings I need to do to allow me to use the 226GB allocated to JAVA.

Categories

Resources