We're using ZGC on JDK 17.0.1. The application successfully runs without unexpected GC stalls for a period of time, after which it gets into a state where the ZGC does not free much memory when the GC is being done. This results in multiple periodic Allocation Stalls within the application from then on.
The application itself does a lot of allocation at a periodic interval of around 35 milliseconds. This is roughly 2GB/s allocation when it’s at peak load. On a typical start in the morning, there is about 30 minutes in which the load is low, till a point at which external data starts arriving, and load is high from then on. The load profile does change as the application is running as a result. Due to the predictable allocation during load, we've set the collection at an interval using the ZCollectionInterval.
We cannot consistently reproduce the Allocation Stalls, even between different machines where the load is very similar. Oddly, if the application is restarted while the load is already high, then it does not typically run into issues.
We did play around with the UseDynamicNumberOfGCThreads as well previously, but it seems to come up with a number of threads that is 2. The chart attached uses SoftMaxHeapSize=1G. This is a bit aggressive, but we did see the same issue without setting the SofMaxHeapSize.
We did also notice that the Concurrent Mark phase of the GC seems to increase leading up to this issue. We also didn’t see anything of note about ZGC within the Release notes of JDK versions up till 17.0.5.
Has anyone run into similar issues before? We're unsure as to why this happens or if this is expected behavior when setting the flags as such. As much hasn't been set and the rarity at which it occurs, we were hoping some one can give some pointers as to if this is actually a known issue with an existing fix or we're just setting it up wrong.
These are the relevant flags.
-XX:-OmitStackTraceInFastThrow
-XX:CICompilerCount=2
-XX:+DisableExplicitGC
-XX:ParallelGCThreads=4
-XX:ConcGCThreads=4
-Xmx8192m
-Xms8192m
-Xss8m
-Xlog:gc*=info:file=${LT_APP_LOG_PATH_FORMAT}.gc:time,level,tags:filecount=32,filesize=64M
-XX:+UseZGC
-XX:+AlwaysPreTouch
-XX:+UseLargePages
-XX:+UseTransparentHugePages
-XX:ZCollectionInterval=4
-Djdk.nio.maxCachedBufferSize=262144
-XX:+HeapDumpOnOutOfMemoryError
-XX:SoftMaxHeapSize=1G
[2022-11-24T09:21:12.378+0530][info][gc,phases ] GC(669) Concurrent Select Relocation Set 8.713ms
[2022-11-24T09:21:12.378+0530][info][gc,phases ] GC(669) Pause Relocate Start 0.008ms
[2022-11-24T09:21:12.420+0530][info][gc,phases ] GC(669) Concurrent Relocate 41.564ms
[2022-11-24T09:21:12.420+0530][info][gc,load ] GC(669) Load: 36.38/31.86/25.10
[2022-11-24T09:21:12.420+0530][info][gc,mmu ] GC(669) MMU: 2ms/99.1%, 5ms/99.6%, 10ms/99.8%, 20ms/99.9%, 50ms/99.9%, 100ms/100.0%
[2022-11-24T09:21:12.420+0530][info][gc,marking ] GC(669) Mark: 4 stripe(s), 2 proactive flush(es), 1 terminate flush(es), 0 completion(s), 0 continuation(s)
[2022-11-24T09:21:12.420+0530][info][gc,marking ] GC(669) Mark Stack Usage: 32M
[2022-11-24T09:21:12.420+0530][info][gc,nmethod ] GC(669) NMethods: 10353 registered, 328 unregistered
[2022-11-24T09:21:12.420+0530][info][gc,metaspace] GC(669) Metaspace: 44M used, 46M committed, 1072M reserved
[2022-11-24T09:21:12.420+0530][info][gc,ref ] GC(669) Soft: 2311 encountered, 98 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref ] GC(669) Weak: 2559 encountered, 652 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref ] GC(669) Final: 56 encountered, 7 discovered, 0 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,ref ] GC(669) Phantom: 27425 encountered, 27313 discovered, 27030 enqueued
[2022-11-24T09:21:12.420+0530][info][gc,reloc ] GC(669) Small Pages: 5764 / 11528M, Empty: 4752M, Relocated: 58M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc ] GC(669) Medium Pages: 3 / 96M, Empty: 64M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc ] GC(669) Large Pages: 0 / 0M, Empty: 0M, Relocated: 0M, In-Place: 0
[2022-11-24T09:21:12.420+0530][info][gc,reloc ] GC(669) Forwarding Usage: 29M
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Min Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Max Capacity: 16384M(100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Soft Max Capacity: 1024M(6%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Mark Start Mark End Relocate Start Relocate End High Low
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Capacity: 16384M (100%) 16384M (100%) 16384M (100%) 16384M (100%) 16384M (100%) 16384M (100%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Free: 4760M (29%) 0M (0%) 4788M (29%) 5392M (33%) 5394M (33%) 0M (0%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Used: 11624M (71%) 16384M (100%) 11596M (71%) 10992M (67%) 16384M (100%) 10990M (67%)
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Live: - 5860M (36%) 5860M (36%) 5860M (36%) - -
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Allocated: - 4760M (29%) 4788M (29%) 4858M (30%) - -
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Garbage: - 5763M (35%) 947M (6%) 273M (2%) - -
[2022-11-24T09:21:12.420+0530][info][gc,heap ] GC(669) Reclaimed: - - 4816M (29%) 5490M (34%) - -
[2022-11-24T09:21:12.420+0530][info][gc ] GC(669) Garbage Collection (Allocation Stall) 11624M(71%)->10992M(67%)
[2022-11-24T09:21:12.479+0530][info][gc,start ] GC(670) Garbage Collection (Allocation Stall)
[2022-11-24T09:21:12.479+0530][info][gc,ref ] GC(670) Clearing All SoftReferences
[2022-11-24T09:21:12.479+0530][info][gc,task ] GC(670) Using 4 workers
[2022-11-24T09:21:12.479+0530][info][gc,phases ] GC(670) Pause Mark Start 0.009ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] === Garbage Collection Statistics =======================================================================================================================
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Last 10s Last 10m Last 10h Total
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Avg / Max Avg / Max Avg / Max Avg / Max
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Collector: Garbage Collection Cycle 3410.686 / 3486.475 844.670 / 3486.475 464.952 / 3486.475 464.952 / 3486.475 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Contention: Mark Segment Reset Contention 7 / 15 3 / 30 1 / 30 1 / 30 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Contention: Mark SeqNum Reset Contention 0 / 1 0 / 3 0 / 3 0 / 3 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: Allocation Stall 1 / 5 0 / 7 0 / 7 0 / 7 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: Allocation Stall 677.602 / 900.858 303.367 / 900.858 303.367 / 900.858 303.367 / 900.858 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: GC Locker Stall 0 / 0 0 / 0 0 / 1 0 / 1 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: GC Locker Stall 0.000 / 0.000 0.000 / 0.000 0.033 / 0.058 0.033 / 0.058 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: Relocation Stall 0 / 0 0 / 0 0 / 0 0 / 0 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Critical: Relocation Stall 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Allocation Rate 1479 / 1860 1003 / 2264 329 / 2264 329 / 2264 MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Out Of Memory 0 / 0 0 / 0 0 / 0 0 / 0 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Flush 0 / 0 0 / 32 0 / 32 0 / 32 MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Hit L1 198 / 827 177 / 901 59 / 901 59 / 901 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Hit L2 541 / 930 316 / 1052 103 / 1052 103 / 1052 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Hit L3 1 / 15 12 / 657 4 / 657 4 / 657 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Page Cache Miss 0 / 0 0 / 1 0 / 1 0 / 1 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Uncommit 0 / 0 0 / 0 0 / 0 0 / 0 MB/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Undo Object Allocation Failed 0 / 0 0 / 1 0 / 1 0 / 1 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Undo Object Allocation Succeeded 0 / 3 4 / 855 1 / 855 1 / 855 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Memory: Undo Page Allocation 0 / 3 0 / 4 0 / 4 0 / 4 ops/s
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Mark 3356.120 / 3437.684 812.574 / 3437.684 445.012 / 3437.684 445.012 / 3437.684 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Mark Continue 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Mark Free 0.001 / 0.001 0.001 / 0.002 0.001 / 0.002 0.001 / 0.002 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Process Non-Strong References 8.650 / 8.848 12.746 / 47.124 8.067 / 47.124 8.067 / 47.124 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Relocate 36.550 / 41.564 14.436 / 55.995 8.357 / 55.995 8.357 / 55.995 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Reset Relocation Set 0.067 / 0.080 0.023 / 0.080 0.013 / 0.080 0.013 / 0.080 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Concurrent Select Relocation Set 7.214 / 8.713 3.382 / 29.365 2.579 / 29.365 2.579 / 29.365 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Pause Mark End 0.011 / 0.011 0.011 / 0.018 0.009 / 0.018 0.009 / 0.018 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Pause Mark Start 0.008 / 0.009 0.008 / 0.016 0.008 / 0.016 0.008 / 0.016 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Phase: Pause Relocate Start 0.008 / 0.008 0.007 / 0.013 0.006 / 0.013 0.006 / 0.013 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Classes Purge 0.266 / 0.497 0.411 / 5.255 0.251 / 5.255 0.251 / 5.255 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Classes Unlink 5.018 / 5.358 9.216 / 28.725 5.991 / 28.725 5.991 / 28.725 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Mark 3352.212 / 3437.123 855.661 / 3437.123 454.341 / 3437.123 454.341 / 3437.123 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Mark Try Flush 0.030 / 0.055 0.760 / 42.368 0.442 / 42.368 0.442 / 42.368 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Mark Try Terminate 0.703 / 1.059 1.393 / 27.549 1.126 / 27.549 1.126 / 27.549 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent References Enqueue 0.003 / 0.004 0.003 / 0.013 0.002 / 0.013 0.002 / 0.013 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent References Process 3.055 / 3.524 1.183 / 9.454 0.655 / 9.454 0.655 / 9.454 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Roots ClassLoaderDataGraph 0.090 / 0.223 0.135 / 12.400 0.105 / 12.400 0.105 / 12.400 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Roots CodeCache 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Roots JavaThreads 0.411 / 1.162 0.520 / 12.391 0.407 / 12.391 0.407 / 12.391 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Roots OopStorageSet 0.018 / 0.038 0.051 / 12.043 0.030 / 12.043 0.030 / 12.043 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Concurrent Weak Roots OopStorageSet 0.259 / 0.263 0.449 / 4.890 0.331 / 4.890 0.331 / 4.890 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] Subphase: Pause Mark Try Complete 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000 ms
[2022-11-24T09:21:13.180+0530][info][gc,stats ] System: Java Threads 28 / 28 26 / 29 21 / 29 21 / 29 threads
[2022-11-24T09:21:13.180+0530][info][gc,stats ] =========================================================================================================================================================
[2022-11-24T09:21:15.944+0530][info][gc,phases ] GC(670) Concurrent Mark 3465.071ms
[2022-11-24T09:21:15.944+0530][info][gc,phases ] GC(670) Pause Mark End 0.011ms
[2022-11-24T09:21:15.944+0530][info][gc,phases ] GC(670) Concurrent Mark Free 0.001ms
[2022-11-24T09:21:15.953+0530][info][gc,phases ] GC(670) Concurrent Process Non-Strong References 8.537ms
[2022-11-24T09:21:15.953+0530][info][gc,phases ] GC(670) Concurrent Reset Relocation Set 0.062ms
[2022-11-24T09:21:15.953+0530][info][gc ] Allocation Stall (lt-primary-pool-2) 704.524ms
[2022-11-24T09:21:15.953+0530][info][gc ] Allocation Stall (lt-primary-pool-0) 704.168ms
[2022-11-24T09:21:15.953+0530][info][gc ] Allocation Stall (lt-secondary-pool-0) 711.668ms
[2022-11-24T09:21:15.955+0530][info][gc ] Allocation Stall (lt-primary-pool-1) 707.109ms
[2022-11-24T09:21:15.955+0530][info][gc ] Allocation Stall (lt-secondary-pool-1) 713.196ms
[2022-11-24T09:21:15.957+0530][info][gc,phases ] GC(670) Concurrent Select Relocation Set 4.082ms
java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)
Graphs created with GCViewer on the GC logs.
ZGC clean up issues
No issues allocation when everything is fine
We did not get to a reason for the issues, however, we did make some progress in examining the issue closer.
When examing the ZGC code for the OpenJDK from github.
https://github.com/openjdk/zgc/blob/1f8813495e8184b6c38319df1c2cb70de7811a76/src/hotspot/share/gc/z/zDirector.cpp#L184
It has code to work out an allocation rate. We also enabled the debug logging of the ZGC.
-Xlog:gc*=debug:file=gc.log:time,level,tags
The logging that looks like the final lines below, predicts the allocation rate to be 0 for extended periods of time for the application start. We suspect this might be one of the reasons for the issue. However, in looking at the GCGraphs it doesn't fully explain why it gets into some panic state when there is enough memory during normal function.
We set the -XX:ZAllocationSpikeTolerance=10 to ensure that the estimated memory use would get close to our normal allocation rate of 2000MB/s. The predicted memory usage for the ZGC from that point on appeared to be correct. We've run this settings for a while now on our services without issues so far. We also removed the -XX:SoftMaxHeapSize= as the GC frequency required seems to be correctly estimated by just the XX:ZAllocationSpikeTolerance.
Example debug logging for Predicted allocation rate.
[2022-12-02T12:45:29.825+0530][debug][gc,director ] Rule: Allocation Rate (Dynamic GC Workers), MaxAllocRate: 9624.6MB/s (+/-0.6%), Free: 3412MB, GCCPUTime: 0.136, GCDuration: 0.136s, TimeUntilOOM: 0.352s, TimeUntilGC: 0.116s, GCWorkers: 1 -> 1
[2022-12-02T12:45:29.825+0530][debug][gc,director ] Rule: High Usage, Free: 3412MB(83.3%)
[2022-12-02T12:45:29.825+0530][debug][gc,director ] Rule: Proactive, AcceptableGCInterval: 3.394s, TimeSinceLastGC: 0.473s, TimeUntilGC: 2.921s
[2022-12-02T12:45:29.925+0530][debug][gc,alloc ] Allocation Rate: 958.0MB/s, Predicted: 960.9MB/s, Avg: 954.4(+/-5.7)MB/s
[2022-12-02T12:45:29.925+0530][debug][gc,director ] Rule: Timer, Interval: 4.000s, TimeUntilGC: 3.427s
[2022-12-02T12:45:29.925+0530][debug][gc,director ] Select GC Workers (Normal), AvoidLongGCWorkers: 0.013, AvoidOOMGCWorkers: 0.393, LastGCWorkers: 1.000, GCWorkers: 0.393
Java Hotspot VM can do a number of different VM operations. When debugging safepoint times it's useful to know what was the purpose of the safepoint. Some of them are obvious: G1IncCollectionPause or FindDeadlocks, but some are not: CGC_Operation, no vm operation. There is VMOps.java, but it only lists possible values, not what they mean.
Currently, I need to know what CGC_Operation does in context of G1GC. I suspect that it is related to ConcurrentGCThread and Old gen collection, but I would like to confirm and also have some references to look for other operations.
Example:
-XX:+PrintSafepointStatistics
...
128959.961: G1IncCollectionPause [ 2636 0 1 ] [ 0 0 0 15 52 ] 0
129986.695: G1IncCollectionPause [ 2637 0 0 ] [ 0 0 0 12 51 ] 0
137019.250: G1IncCollectionPause [ 2636 0 0 ] [ 0 0 0 13 50 ] 0
138693.219: CGC_Operation [ 2636 0 0 ] [ 0 0 0 13 338 ] 0
138726.672: G1IncCollectionPause [ 2636 0 0 ] [ 0 0 0 13 50 ] 0
138733.984: G1IncCollectionPause [ 2636 0 1 ] [ 0 0 0 13 50 ] 0
138738.750: G1IncCollectionPause [ 2636 0 0 ] [ 0 0 0 13 62 ] 0
The best (probably, the only) documentation is the source code. Fortunately, HotSpot JVM sources are very well commented.
See src/share/vm/gc_implementation/g1/vm_operations_g1.hpp:
// Concurrent GC stop-the-world operations such as remark and cleanup;
// consider sharing these with CMS's counterparts.
class VM_CGC_Operation: public VM_Operation {
no vm operation denotes the special type of periodic safepoint for various cleanup activities, see this answer
I detected on a server long safepoints (>10sec!) in jvm safepoint.log:
6534.953: no vm operation [ 353 0 4 ] [ 0 0 14179 0 0 ] 0
7241.410: RevokeBias [ 357 0 1 ] [ 0 0 14621 0 0 ] 0
8501.278: BulkRevokeBias [ 356 0 6 ] [ 0 0 13440 0 2 ] 0
9667.681: no vm operation [ 349 0 8 ] [ 0 0 15236 0 0 ] 0
12094.170: G1IncCollectionPause [ 350 0 4 ] [ 0 0 15144 1 24 ] 0
13383.412: no vm operation [ 348 0 4 ] [ 0 0 15783 0 0 ] 0
13444.109: RevokeBias [ 349 0 2 ] [ 0 0 16084 0 0 ] 0
On my laptop I've played with -XX:SafepointTimeoutDelay=2
and it works good, printing bad threads:
# SafepointSynchronize::begin: Timeout detected:
...
# SafepointSynchronize::begin: (End of list)
<writer thread='11267'/>
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
567.766: BulkRevokeBias [ 78 1 2 ] [ 0 6 6 0 0 ] 0
So I've added to options to the server:
-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=1000
to see which threads cause the problem, but I don't see any prints, while I still see long safepoint times.
Why doesn't it applied on the server?
Here is the actual server config (taken from safepoint.log):
Java HotSpot(TM) 64-Bit Server VM (25.202-b08) for linux-amd64 JRE (1.8.0_202-b08), built on Dec 15 2018 12:40:22 by "java_re" with gcc 7.3.0
...
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=10
-XX:+UnlockDiagnosticVMOptions
-XX:+LogVMOutput
-XX:LogFile=/opt/pprb/card-pro/pci-pprb-eip57/logs/safepoint.log
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=1000
...
In safepoint, "Total time for which application threads were stopped: 18.0049752 seconds, Stopping threads took: 18.0036770 seconds" maybe caused by a thread wait for lock, and maybe not.
When SafepointTimeoutDelay=1000, if more than one thread wait for 1s, there will invoke SafepointSynchronize::print_safepoint_timeout method in safepoint.cpp to print certain ThreadSafepointState.
But when all thread come to safepoint and other reason to hold in 18s, the method will be not called and no logs for it.
We can set safepoint=trace in jdk9+ to know all thread state in gc log.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
Running Ubuntu 12.04.3 LTS with 32 cores 244GB. Its the Amazon EC2 memory instance the big one and Java 1.7u25
My java process is running with -Xmx226g
I'm trying to create a really large local cache using CQEngine and so far its blazingly fast with 30,000,000 records. Of course I will add an eviction policy that will allow garbage collection to clean up old objects evicted, but really trying to push the limits here :)
When looking at jvisualvm, the total heap is at about 180GB which dies 40GB to soon. I should be able to squeeze out a bit more.
Not that I don't want the kernel to kill a process if it runs out of resources but I think it's killing it to early and want to squeeze the mem usage as much as possible.
The ulimit output is as follows...
ubuntu#ip-10-156-243-111:/var/log$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1967992
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1967992
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
The kern.log output is...
340 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
63999984 pages RAM
1022168 pages reserved
649 pages shared
62830686 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 505] 0 505 4342 93 9 0 0 upstart-udev-br
[ 507] 0 507 5456 198 2 -17 -1000 udevd
[ 642] 0 642 5402 155 28 -17 -1000 udevd
[ 643] 0 643 5402 155 29 -17 -1000 udevd
[ 739] 0 739 3798 49 10 0 0 upstart-socket-
[ 775] 0 775 1817 124 25 0 0 dhclient3
[ 897] 0 897 12509 152 10 -17 -1000 sshd
[ 949] 101 949 63430 91 9 0 0 rsyslogd
[ 990] 102 990 5985 90 8 0 0 dbus-daemon
[ 1017] 0 1017 3627 40 9 0 0 getty
[ 1024] 0 1024 3627 41 10 0 0 getty
[ 1029] 0 1029 3627 43 6 0 0 getty
[ 1030] 0 1030 3627 41 3 0 0 getty
[ 1032] 0 1032 3627 41 1 0 0 getty
[ 1035] 0 1035 1083 34 1 0 0 acpid
[ 1036] 0 1036 4779 49 5 0 0 cron
[ 1037] 0 1037 4228 40 8 0 0 atd
[ 1045] 0 1045 3996 57 3 0 0 irqbalance
[ 1084] 0 1084 3627 43 2 0 0 getty
[ 1085] 0 1085 3189 39 11 0 0 getty
[ 1087] 103 1087 46916 300 0 0 0 whoopsie
[ 1159] 0 1159 20490 215 0 0 0 sshd
[ 1162] 0 1162 1063575 263 15 0 0 console-kit-dae
[ 1229] 0 1229 46648 153 4 0 0 polkitd
[ 1318] 1000 1318 20490 211 10 0 0 sshd
[ 1319] 1000 1319 6240 1448 1 0 0 bash
[ 1816] 1000 1816 70102543 62010032 4 0 0 java
[ 1947] 0 1947 20490 214 6 0 0 sshd
[ 2035] 1000 2035 20490 210 0 0 0 sshd
[ 2036] 1000 2036 6238 1444 13 0 0 bash
[ 2179] 1000 2179 13262 463 2 0 0 vi
Out of memory: Kill process 1816 (java) score 987 or sacrifice child
Killed process 1816 (java) total-vm:280410172kB, anon-rss:248040128kB, file-rss:0kB
The kern.log clearly states that it killed my process because it ran out of memory. But like I said I think I can squeeze it a bit more. Is there any settings I need to do to allow me to use the 226GB allocated to JAVA.