rocksdb out of memory - java

I'm trying to find out why my kafka-streams application runs out of memory.
I already found out that rocksDB is consuming lots of native memory and I tried to restrict it with the following configuration:
# put index and filter blocks in blockCache to avoid letting them grow unbounded (https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-and-filter-blocks)
cache_index_and_filter_blocks = true;
# avoid evicting L0 cache of filter and index blocks to reduce performance impact of putting them in the blockCache (https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-and-filter-blocks)
pinL0FilterAndIndexBlocksInCache=true
# blockCacheSize should be 1/3 of total memory available (https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#block-cache-size)
blockCacheSize=1350 * 1024 * 1024
# use larger blockSize to reduce index block size (https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#difference-of-spinning-disk)
blockSize=256 * 1024
but still the memory usage seems to grow unbounded and my container eventually gets OOMKilled.
I used jemalloc to profile the memory usage (like described here) and
the result clearly shows that rocksDB is responsible but I have no clue how to further restrict the memory usage of rocksDB.
I don't know if it is helpful, but for completeness here are statistics gathered from a running rocksdb instance:
I'm glad for any hints

I found out what was causing this.
I thought that my kafka streams application would have only one rockDB instance.
But there is one instance per stream partition. So this configuration:
blockCacheSize=1350 * 1024 * 1024
Does not necessarily mean that the rocksDB memory is restricted to 1350MB. If the application has e.g. 8 stream partitions assigned it also has 8 blockCaches and thus can take up to 1350 * 8 = ~11GB of memory.

Are you seeing the memory usage grow quickly or over a longer period of time?
We have found and fixed a few RocksDB resource leaks that would cause memory leaks:
BloomFilters can leak (https://issues.apache.org/jira/browse/KAFKA-8323) This was fixed in 2.2.1 and (pending 2.3.0)
Custom RocksDB configs are doomed to create leaks (https://issues.apache.org/jira/browse/KAFKA-8324) This will be fixed in 2.3.0
There are some indications that there may be others (https://issues.apache.org/jira/browse/KAFKA-8367), either in our usage of RocksDB or in RocksDB itself.
Oh, one other idea is that if you're using iterators from the state stores, either in your processors or in Interactive Query, you have to close them.
Beyond looking for leaks, I'm afraid I don't have too much insight into diagnosing RocksDB's memory usage. You could also restrict the Memtable size, but I don't think we set it very large by default anyway.
Hope this helps,
-John

Related

What is overhead of Java Native Memory Tracking in "summary" mode?

I'm wondering what is the real/typical overhead when NMT is enabled via ‑XX:NativeMemoryTracking=summary (the full command options I'm after are -XX:+UnlockDiagnosticVMOptions ‑XX:NativeMemoryTracking=summary ‑XX:+PrintNMTStatistics)
I could not find much information anywhere - either on SO, blog posts or the official docs.
The docs say:
Note: Enabling NMT causes a 5% -10% performance overhead.
But they do not say which mode is expected to have this performance overhead (both summary and detail?)
and what this overhead really is (CPU, memory, ...).
In Native Memory Tracking guide they claim:
Enabling NMT will result in a 5-10 percent JVM performance drop, and memory usage for NMT adds 2 machine words to all malloc memory as a malloc header. NMT memory usage is also tracked by NMT.
But again, is this true for both summary and detail mode?
What I'm after is basically whether it's safe to add ‑XX:NativeMemoryTracking=summary permanently for a production app (similar to continuous JFR recording) and what are potential costs.
So far, when testing this on our app, I didn't spot a difference but it's difficult to
Is there an authoritative source of information containing more details about this performance overhead?
Does somebody have experience with enabling this permanently for production apps?
Disclaimer: I base my answer on JDK 18. Most of what I write should be valid for older releases. When in doubt, you need to measure for yourself.
Background:
NMT tracks hotspot VM memory usage and memory usage done via Direct Byte Buffers. Basically, it hooks into malloc/free and mmap/munmap calls and does accounting.
It does not track other JDK memory usage (outside hotspot VM) or usage by third-party libraries. That matters here, since it makes NMT behavior somewhat predictable. Hotspot tries to avoid fine-grained allocations via malloc. Instead, it relies on custom memory managers like Arenas, code heap or Metaspace.
Therefore, for most applications, mallocs/mmap from hotspot are not that "hot" and the number of mallocs is not that large.
I'll focus on malloc/free for the rest of this writeup since they outnumber the number of mmaps/munmaps by far:
Memory cost:
(detail+summary): 2 words per malloc()
(detail only): Malloc site table
(detail+summary): mapping lists for vm mappings and thread stacks
Here, (1) completely dwarfs (2) and (3). Therefore the memory overhead between summary and detail mode is not significant.
Note that even (1) may not matter that much, since the underlying libc allocator already dictates a minimal allocation size that may be larger than (pure allocation size + 16 byte malloc header). So, how much of the NMT memory overhead actually translates into RSS increase needs to be measured.
How much memory overhead in total this means cannot be answered since JVM memory costs are usually dominated by heap size. So, comparing RSS with NMT cost is almost meaningless. But just to give an example, for spring petclinic with 1GB pretouched heap NMT memory overhead is about 0.5%.
In my experience, NMT memory overhead only matters in pathological situations or in corner cases that cause the JVM to do tons of fine-grained allocations. E.g. if one does insane amount of class loading. But often these are the cases where you want NMT enabled, to see what's going on.
Performance cost:
NMT does need some synchronization. It atomically increases counters on each malloc/free in summary mode.
In detail mode, it does a lot more:
capture call stack on malloc site
look for call stack in hash map
increase hash map entry counters
This requires more cycles. Hash map is lock free, but still modified with atomic operations. It looks expensive, especially if hotspot does many mallocs from different threads. How bad is it really?
Worst case example
Micro-benchmark, 64 mio malloc allocations (via Unsafe.allocateMemory()) done by 100 concurrent threads, on a 24 core machine:
NMT off: 6 secs
NMT summary: 34 secs
NMT detail: 46 secs
That looks insane. However, it may not matter in practice, since this is no real-life example.
Spring petclinic bootup, average of ten runs:
NMT off: 3.79 secs
NMT summary: 3.79 secs (+0%)
NMT detail: 3.91 secs (+3%)
So, here, not so bad. Cost of summary mode actually disappeared in test noise.
Renaissance, philosophers benchmark
Interesting example, since this does a lot of synchronization, leads to many object monitors being inflated, and those are malloced:
Average benchmark scores:
NMT off: 4697
NMT summary: 4599 (-2%)
NMT detail: 4190 (-11%)
Somewhat in the middle between the two other examples.
Conclusion
There is no clear answer.
Both memory- and performance cost depend on how many allocations the JVM does.
This number is small for normal well-behaved applications, but it may be large in pathological situations (e.g. JVM bugs) as well as in some corner case scenarios caused by user programs, e.g., lots of class loading or synchronization. To be really sure, you need to measure yourself.
The overhead of Native Memory Tracking obviously depends on how often the application allocates native memory. Usually, this is not something too frequent in a Java application, but cases may differ. Since you've already tried and didn't notice performance difference, your application is apparently not an exception.
In the summary mode, Native Memory Tracking roughly does the following things:
increases every malloc request in the JVM by 2 machine words (16 bytes);
records the allocation size and flags in these 2 words;
atomically increments (or decrements on free) the counter corresponding to the given memory type;
besides malloc and free, it also handles changes in virtual memory reservation and allocations of new arenas, but these are even less frequent than malloc/free calls.
So, to me, the overhead is quite small; 5-10% is definitely a large overestimation (the numbers would make sense for detail mode which collects and stores stack traces, which is expensive, but summary doesn't do that).
When many threads concurrently allocate/free native memory, the update of an atomic counter could become a bottleneck, but again, that's more like an extreme case. In short, if you measured a real application and didn't notice any degradation, you're likely safe to enable NMT summary in production.

How to increase Flink Memory size

I'm Trying to run a Job on Flink task manager and I'm getting this exception :
Initializing the input processing failed: Too little memory provided to sorter to perform task. Required are at least 12 pages. Current page size is 32768 bytes.
I've set heap size in both task and job manager's via flink-conf.yml , anything else I should change to increase the memory ?
taskmanager.heap.size: 4096m
taskmanager.memory.size: 4096m
jobmanager.heap.size: 2048m
The error message indicates that the sorter does not get enough memory pages. The reason is that the available managed memory is not sufficient. There are multiple ways to solve this problem:
Increase the available memory for a TaskManager via taskmanager.heap.size
Increase the fraction of the managed memory which is taken from taskmanager.heap.size via taskmanager.memory.fraction (per default it is 0.7)
Decrease the page size via taskmanager.memory.segment-size
Decrease the number of slots on a TaskManager since a reduced parallelism per TM will decrease the number of memory consumers on the TM (operators get a bigger share of the available memory)
If you are running exclusively batch loads, then you should also activate taskmanager.memory.preallocate: true which will enable the memory allocation at start-up time. This is usually faster because it reduces the garbage collection pressure.
Another comment concerning taskmanager.memory.size: This value always needs to be smaller or equal than taskmanager.heap.size since it specifies how much memory from the overall heap space will be used for managed memory. If this parameter is not specified, then Flink will take a fraction of the available heap memory for the managed memory (specified via taskmanager.memory.fraction).

Sensible Xmx/GC defaults for a microservice with a small heap

At my company we are trying an approach with JVM based microservices. They are designed to be scaled horizontally and so we run multiple instances of each using rather small containers (up to 2G heap, usually 1-1.5G). The JVM we use is 1.8.0_40-b25.
Each of such instances typically handles up to 100 RPS with max memory allocation rate around 250 MB/s.
The question is: what kind of GC would be a safe sensible default to start off with? So far we are using CMS with Xms = Xmx (to avoid pauses during heap resizing) and Xms = Xmx = 1.5G. Results are decent - we hardly ever see any Major GC performed.
I know that G1 could give me smaller pauses (at the cost of total throughput) but AFAIK it requires a bit more "breathing" space and at least 3-4G heap to perform properly.
Any hints (besides going for Azul's Zing :D) ?
Hint # 1: Do experiments!
Assuming that your microservice is deployed at least on two nodes run one on CMS, another on G1 and see what response times are.
Not very likely, but what if you can find that with G1 performance is so good that need half of original cluster size?
Side notes:
re: "250Mb/s" -> if all of this is stack memory (alternatively, if it's young gen) then G1 would provide little benefit since collection form these areas is free.
re: "100 RPS" -> in many cases on our production we found that reducing concurrent requests in system (either via proxy config, or at application container level) improves throughput. Given small heap it's very likely that you have small cpu number as well (2 to 4).
Additionally there are official Oracle Hints on tuning for a small memory footprint. It might not reflect latest config available on 1.8_40, but it's good read anyway.
Measure how much memory is retained after a full GC. add to this the amount of memory allocated per second and multiply by 2 - 10 depending on how often you would like to have a minor GC. e.g. every 2 second or every 10 second.
E.g. say you have up to 500 MB retained after a full GC and GCing every couple of seconds is fine, you can have 500 MB + 2 * 250 MB, or a heap of around 1 GB.
The number of RPS is not important.

Calculating Infinispan cache memory size

I need to get rough estimation of memory usage of my Infinispan cache ( which is implemented using version 5.3.0) - ( For learning purposes )
Since there is no easy way to do this, I came up with following procedure.
Add cache listener to listen cache put/remove events and log the size of inserted entry using jamm library which uses java.lang.instrument.Instrumentation.getObjectSize. But i'm little skeptical about it, whether it returns right memory usage for cache. Am I doing this measurement correctly ? Am I missing something here or do I need consider more factors to do this ?
If you don't need real-time information, the easiest way to find the actual memory usage is to take a heap dump and look at the cache objects' retained sizes (e.g. with Eclipse MAT).
Your method is going to ignore the overhead of Infinispan's internal structures. Normally, the per-entry overhead should be somewhere around 150 bytes, but sometimes it can be quite big - e.g. when you enable eviction and Infinispan allocates structures based on the configured size (https://issues.jboss.org/browse/ISPN-4126).

Setting a smaller JVM heap size within a JNI client application

I'm attempting to debug a problem with pl/java, a procedural language for PostgreSQL. I'm running this stack on a Linux server.
Essentially, each Postgres backend (connection process) must start its own JVM, and does so using the JNI. This is generally a major limitation of pl/java, but it has one particularly nasty manifestation.
If native memory runs out (I realise that this may not actually be due to malloc() returning NULL, but the effect is about the same), this failure is handled rather poorly. It results in an OutOfMemoryError due to "native memory exhaustion". This results in a segfault of the Postgres backend, originating from within libjvm.so, and a javacore file that says something like:
0SECTION TITLE subcomponent dump routine
NULL ===============================
1TISIGINFO Dump Event "systhrow" (00040000) Detail "java/lang/OutOfMemoryError" "Failed to create a thread: retVal -1073741830, errno 11" received
1TIDATETIME Date: 2012/09/13 at 16:36:01
1TIFILENAME Javacore filename: /var/lib/PostgreSQL/9.1/data/javacore.20120913.104611.24742.0002.txt
***SNIP***
Now, there are reasonably well-defined ways of ameliorating these types of problems with Java, described here:
http://www.ibm.com/developerworks/java/library/j-nativememory-linux/
I think that it would be particularly effective if I could set the maximum heap size to a value that is far lower than the default. Ordinarily, it is possible to do something along these lines:
The heap's size is controlled from the Java command line using the -Xmx and -Xms options (mx is the maximum size of the heap, ms is the initial size). Although the logical heap (the area of memory that is actively used) can grow and shrink according to the number of objects on the heap and the amount of time spent in GC, the amount of native memory used remains constant and is dictated by the -Xmx value: the maximum heap size. Most GC algorithms rely on the heap being allocated as a contiguous slab of memory, so it's impossible to allocate more native memory when the heap needs to expand. All heap memory must be reserved up front.
However, it is not apparent how I can follow these steps such that pl/java's JNI initialisation initialises a JVM with a smaller heap; I can't very well pass these command line arguments to Postgres. So, my question is, how can I set the maximum heap size or otherwise control these problems in this context specifically? This appears to be a general problem with pl/java, so I expect to be able to share whatever solution I eventually arrive at with the Postgres community.
Please note that I am not experienced with JVM internals, and am not generally familiar with Java.
Thanks
According to slide 19 in this presentation postgresql.conf can have the parameter pljava.vmoptions where you can pass arguments to the JVM.

Categories

Resources