I have a loop with 701 iterations of similar complex calculations. I measured the execution time of each iteration for three runs. As you can see in the chart I'm getting strange peaks. Is there any common aproach which is able to explain these peaks without analyzing the code inside the loop.
Execution Time
Is it possible that the gc is starting at these points and slow down the other parts?
It depends.
If you don't want to analyze the code inside the loop you have to analyze at least the usage of memory during excution. For example if algorithm does not create many garbage and you have configured enough heap and you have chosen the right gc algorithm then the gc does not require any "stop the world".
As first thing you have to activate gc logging (see here: http://www.oracle.com/technetwork/articles/javase/gcportal-136937.html) and check for gc peaks in correspondance to your time peaks.
If you want to analyze the gc log you can use this tool http://www.tagtraum.com/gcviewer.html.
Then follow the link posted from Turing85 about micro-benchmarks. It is more complete.
Related
There are some articles available online where they mention some of the cons of using Stream-s over old loop-s:
https://blog.jooq.org/2015/12/08/3-reasons-why-you-shouldnt-replace-your-for-loops-by-stream-foreach/
https://jaxenter.com/java-performance-tutorial-how-fast-are-the-java-8-streams-118830.html
But is there any impact from the GC perspective? As I assume (is it correct?) that every stream call creates some short-lived objects underneath. If the particular code fragment which uses streams is called frequently by the underlying system could it cause eventually some performance issue from the GC perspective or put extra pressure on GC? Or the impact is minimal and could be ignored most of the time?
Are there any articles covering this more in detail?
To be fair, it's very complicated to give an answer when Holger already linked the main idea via his answer; still I will try to.
Extra pressure on GC - may be. Extra time for a GC cycle to execute - most probably not. Ignorable? I'd say totally. In the end what you care from a GC - that it takes little time to reclaim lots of space, preferably with super tiny stop-the-world events.
Let's talk about the potential overhead in the GC main two phases : mark and evacuation/realocation (Shenandoah/ZGC). First mark phase, where GC finds out what is garbage (by actually identifying what is alive).
If objects that were created by the Stream internals are not reachable, they will never be scanned (zero overhead here) and if they are reachable, scanning them will be extremely fast. The other side of the story is: when you create an Object and GC might touch it while it's running in the mark phase, the slow path of a LoadBarrier (in case of Shenandoah) will be active. This will add some tens of ns I assume to the total time of that particular phase of the GC as well as some space in the SATB queues. Aleksey Shipilev in one talk said that he tried to measure the overhead from executing a single barrier and could not, so he measured 3 and the time was in the region of tens of ns. I don't know the exact details of ZGC, but a LoadBarrier is there in place too.
The main point is that this mark phase is done in a concurrent fashion, while the application is running, so you application will still run perfectly fine. And even if some GC code will be triggered to do something specific work (Load Barrier), it will be extremely fast and completely transparent to you.
The second phase is "compactation", or making space for future allocations. What a GC does is move live objects from regions with the most garbage (Shenandoah for sure) to regions that are empty. But only live objects. So if a certain region has 100 objects and only 1 is alive, only 1 will be moved, then that entire region is going to be marked as free. So potentially if the Stream implementation generated only garbage (i.e.: not currently alive), it is "free lunch" for GC, it will not even know it existed.
The better picture here is that this phase is still done concurrently. To keep the "concurrency" active, you need to know how much was allocated from start to end of a GC cycle. This amount is the minimum "extra" space you need to have on top of the java process in order for a GC to be happy.
So overall, you are looking at a super tiny impact; if any at all.
I am performing analysis of different sort algorithms. Currently I am analysing the Insertion Sort and Quick Sort. And as part of the analysis, I need to measure the memory consumption.
I am using Visual VM for profiling. However when I execute the Insertion Sort for a random data set of, let's say 70,000, I get different range of Heap Memory usage. For example, in the first run the heap memory consumption was 75 kbytes and then in the next round it drops to 35 kbytes. And if I execute it few more times then this value fluctuates randomly.
Is this normal or am I missing something here ? I have plot a graph of data size versus the memory consumption and with this fluctuation I won't be able to draw a chart.
java version "1.8.0_65"
This is Java's garbage collector at work, it kicks in at its own pace and does its job. Perhaps, it would be best for you to measure the amount of memory used after explicitly calling System.gc(), so that you're not taking notes of the garbage.
EDIT:
System.gc() should be called after you perform your tests, to explicitly request that garbage collector kicks in. While it is true that System.gc() is treated only as a request and it is not mathematically 100% sure that JVM will respect your request, it is most probably safe for your analysis, especially if you perform several runs of it.
With regards to measuring memory usage, it is quite tricky, especially for low values. Please see this answer which contains some nice details:
You may find JMH useful for running benchmarks while isolating side effects from the JVM.
Read through the code samples to understand how to use it.
I implemented a heuristic in Java that solves an optimization problem for a given input. The heuristic can run for thousands of iterations and create lots of objects of varying complexity.
In order to test it, I have thousands of test inputs. My main method takes all inputs and sequentially starts the heuristic for each input in a loop. The results are stored in a separate file for each input.
When I run the program, it always stops after producing 218 or 219 and throws an "OutOfMemoryError". Once it says Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded and once Exception in thread "main" java.lang.OutOfMemoryError: Java heap space.
My guess is, the program creates too many objects over time until it runs out of memory when computing the 218th or 219th input. Every instance is computed in an independent run. Hence, it should solve the problem to clear the memory and getting rid of all created objects after the result for an input is stored and before the next input is parsed. Is that correct? I heard using System.gc() is bad practice, but what else would you recommend in my case?
Edit:
To specify what I want: Instead of pressing "start" for each input, I implemented the loop to do that for me. However, it seems like it doesn't behave the same way and it keeps old objects from previous runs. Can I change my java code in such a way that it behaves similar to starting the program anew for each input? Or do I have to use a shell skript that starts my heuristic for each input separatly to make it work?
I have never used any JVM parameters and it seems to me like they don't really tackle the problem.
Resolved: There was in fact a memory leak that I discovered and fixed. No System.gc() needed. Thanks for helping anyways!
Yes leave GC handling with JVM. You need to follow some of the steps mentioned below in order:
Increase your heap size using Xmx... parameter
Set proper GC algorithm and parameters. If you have already have GC parameters try to tune the parameters
Try using -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=<path for heap dump> option when you start your JVM, so you get heap dump when your jvm runs OOM. By using the heap dump, you could use profilers like jprofiler/yourkit/jvisualvm etc to investigate memory leaks and then rectify the same.
First, when you start a JVM to run your tests, disable the GC overhead limit:
-XX:-UseGCOverheadLimit
I recommend this because you already know you're purposefully stressing the garbage collector, and you don't want it to warn you about GC overhead.
Second, take a look at how you can break up your tests better, in such a way that you're allowing objects from the previous test to be garbage collected. Don't keep active pointers to large structures of objects after each test completes.
Third, if you still need more memory due to exceeding Java heap space, use:
-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size
If you know you'll be using the memory anyhow, it works best to set both of these to the same value, which prevents thrashing during execution.
Don't bother explicitly calling System.gc(), it's ultimately pointless because garbage collection is always going to happen when it's necessary.
Fourth, another JVM setting which could be useful in your circumstances:
-XX:NewRatio=<n> Ratio of old/new generation sizes. The default value is 2.
It's normally not recommended to set this lower than 2 (2/3 old, 1/3 new), but in your situation I might suggest you try setting this to 1 (1/2 old, 1/2 new).
See also GC overhead limit exceeded and check out Java HotSpot VM Options.
Give this a try:
http://javaandroidandrest.blogspot.de/2012/06/wait-for-jvm-garbage-collector.html
From the site:
Using functions like System.gc(); or Runtime.getRuntime().gc(); only suggest to the JVM that you want to run the garbage collector.
I found a way on the internet not to force the grabage collector but to wait until the garbage collector runs.
I have a Java application and one of the methods is performance-critical.
I created a loop to call this method 10 times and I am checking for performance issues by using the profiler for every iteration. It turned out that the execution time decreases by iterations. Thus, the 10th iteration has a smaller execution time than then 9th iteration.
Any idea why such case is happening?
Could it be due to the loop overheads?
You are warming the CPU caches, and the JVM thus the performance changes.
Profillers put the JVM into an unusual mode, and depending on what profiler approach you are using then it may only be sampling at a regular interval.
I find that profillers are good for giving you relative measurements and to improve your understanding of the code; but always take their reading with a pinch of salt.
Do not trust just a single measurement.
Outside of using profillers, microbenchmarking is a good way to go. Although it is a very tricky subject.
Note that Hotspot tends not to kick in and optimise the byte codes until the target code has been called 10,000 or more times.
http://java.dzone.com/articles/microbenchmarking-java, and How do I write a correct micro-benchmark in Java? may help to get you started. There is also a lot of good advice on the Mechanical Sympathy Forum.
A good microbenchmarking framework is here http://openjdk.java.net/projects/code-tools/jmh/, it helps keep GC, and other JVM stop-the-world events out of the timings. As well as some guidence on how to prevent Hotspot from optimising out the very code that you are trying to measure.
I have a Java program that operates on a (large) graph. Thus, it uses a significant amount of heap space (~50GB, which is about 25% of the physical memory on the host machine). At one point, the program (repeatedly) picks one node from the graph and does some computation with it. For some nodes, this computation takes much longer than anticipated (30-60 minutes, instead of an expected few seconds). In order to profile these opertations to find out what takes so much time, I have created a test program that creates only a very small part of the large graph and then runs the same operation on one of the nodes that took very long to compute in the original program. Thus, the test program obviously only uses very little heap space, compared to the original program.
It turns out that an operation that took 48 minutes in the original program can be done in 9 seconds in the test program. This really confuses me. The first thought might be that the larger program spends a lot of time on garbage collection. So I turned on the verbose mode of the VM's garbage collector. According to that, no full garbage collections are performed during the 48 minutes, and only about 20 collections in the young generation, which each take less than 1 second.
So my questions is what else could there be that explains such a huge difference in timing? I don't know much about how Java internally organizes the heap. Is there something that takes significantly longer for a large heap with a large number of live objects? Could it be that object allocation takes much longer in such a setting, because it takes longer to find an adequate place in the heap? Or does the VM do any internal reorganization of the heap that might take a lot of time (besides garbage collection, obviously).
I am using Oracle JDK 1.7, if that's of any importance.
While bigger memory might mean bigger problems, I'd say there's nothing (except the GC which you've excluded) what could extend 9 seconds to 48 minutes (factor 320).
A big heap makes seemingly worse spatial locality possible, but I don't think it matters. I disagree with Tim's answer w.r.t. "having to leave the cache for everything".
There's also the TLB which a cache for the virtual address translation, which could cause some problems with very large memory. But again, not factor 320.
I don't think there's anything in the JVM which could cause such problems.
The only reason I can imagine is that you have some swap space which gets used - despite the fact that you have enough physical memory. Even slight swapping can be the cause for a huge slowdown. Make sure it's off (and possibly check swappiness).
Even when things are in memory you have multiple levels of caching of data on modern CPUs. Every time you leave the cache to fetch data the slower that will go. Having 50GB of ram could well mean that it is having to leave the cache for everything.
The symptoms and differences you describe are just massive though and I don't see something as simple as cache coherency making that much difference.
The best advice I can five you is to try running a profiler against it both when it's running slow and when it's running fast and compare the difference.
You need solid numbers and timings. "In this environment doing X took Y time". From that you can start narrowing things down.