I have a Java server running on a beefy server machine (many cores, 64Gb RAM, etc.) and I submit some workloads to it in test scenario; I submit one workload, exactly the same, 10 times in a row in each test. One one particular workload, I observe that in the middle of 10 runs, it takes much longer to complete (i.e. runs 1-2 - 10sec, 3 - 12sec, 4 - 25sec., 5 - 10sec., etc.). In yourkit profile with wall time from the server, I see no increase in IO, GC, network, or pretty much anything during the slowdown; no particular methods increase in proportion of time spent - every method is just slower, roughly in proportion. What I do see is that average CPU usage decreases (presumably because same work is spread over more time), but kernel CPU usage increases - from 0-2% on faster workloads, to 9-12% on slow one. Kernel usage crawls slowly up from the end of the previous workload, which is slightly slower, stays high, then drops between the slow and next workload (there's a pause). I cannot map this kernel CPU to any calls from yourkit.
Does anyone have an idea what this can be? Or suggest further venues of investigation that might show where kernel time goes?
Related
I recently came across this question for an assessment:
ExecutorService threadpool = Executors.newFixedThreadPool(N);
for(Runnable task : tasks){
threadpool.submit(task);
}
Each task spends 25% for the computation and 75% doing I/O. assume we are working on a quad core machine(no hyper threading), what should be the size of thread pool N to achieve maximum performance without wasting threads? (assume we have infinite I/O capacity)
I guessed 16 as the machine has infinite I/O it means, that we can concentrate fully on CPUs. Each task is using one quarter of a CPU while running. That means, we can have four tasks running to saturate one CPU core and that makes N=16 on a quad core machine.
update: the options for this questions were 2,4,5,6,7,8,12,and 16.
You are correct that you should be thinking about saturating your cores. The best answer will be more than 16, though. If you have only 16 threads, then the CPU demands aren't going to align perfectly so that all your cores are in use all the time.
So the best answer is > 16, but also small enough not to significantly increase individual task completion time, impose significant thread-switching costs, or waste a whole lot of memory.
If you learned this in class, then your prof probably gave you multiplier to use as a "rule of thumb". He would be expecting you to remember it and apply it here.
I usually use average_demand = 2*num_cores, so would pick 32 threads. This works well in most cases. When the average CPU demand is twice the number of cores, the core utilization will be pretty close to 100%.
Also, in this case, the CPU portion of each task only gets 1/2 core on average, so it takes twice as long... but it's only 25% of the work so the task completion time is only 13% more than optimal.
The 2-times default that I use is almost always higher than the optimal number, but its also almost always low enough not to impose significant extra overheads. If you know that your tasks are very heavily CPU-bound, then you can confidently reduce this number.
If you really want to find the optimal value, then you can measure it, but when your in the right range it's not gonna make a lot of difference.
--
P.S NOTE: The 'average_demand' I used above is the expected number of cores that would be in use at any time given N threads and N cores.
Although this question has no strictly right or wrong answer, the subjectively good one would be:
32 Threads
You have to think in terms of probability.
For now let's just consider one CPU core and independent threads:
One thread has a 25% chance to be doing a computation at any given time.
If you have 2 independent threads (probability events), the probability of having at least one doing some CPU work is not 50% but 7/16 (43.75%). (If you are not sure about that, you should refresh some of those probability skills).
You probably see where this is going. For the P to be 100%, the thread count would have to be infinite. So we have to make an educated guess:
4 threads have a P of ~68%, 8 threads ~90%. To go up in count would be now really unproductive, so we settle at 8. That is for one core. We have 4 CPU cores, so we can multiply that by 4 and we get the final answer: 32.
At my company we are trying an approach with JVM based microservices. They are designed to be scaled horizontally and so we run multiple instances of each using rather small containers (up to 2G heap, usually 1-1.5G). The JVM we use is 1.8.0_40-b25.
Each of such instances typically handles up to 100 RPS with max memory allocation rate around 250 MB/s.
The question is: what kind of GC would be a safe sensible default to start off with? So far we are using CMS with Xms = Xmx (to avoid pauses during heap resizing) and Xms = Xmx = 1.5G. Results are decent - we hardly ever see any Major GC performed.
I know that G1 could give me smaller pauses (at the cost of total throughput) but AFAIK it requires a bit more "breathing" space and at least 3-4G heap to perform properly.
Any hints (besides going for Azul's Zing :D) ?
Hint # 1: Do experiments!
Assuming that your microservice is deployed at least on two nodes run one on CMS, another on G1 and see what response times are.
Not very likely, but what if you can find that with G1 performance is so good that need half of original cluster size?
Side notes:
re: "250Mb/s" -> if all of this is stack memory (alternatively, if it's young gen) then G1 would provide little benefit since collection form these areas is free.
re: "100 RPS" -> in many cases on our production we found that reducing concurrent requests in system (either via proxy config, or at application container level) improves throughput. Given small heap it's very likely that you have small cpu number as well (2 to 4).
Additionally there are official Oracle Hints on tuning for a small memory footprint. It might not reflect latest config available on 1.8_40, but it's good read anyway.
Measure how much memory is retained after a full GC. add to this the amount of memory allocated per second and multiply by 2 - 10 depending on how often you would like to have a minor GC. e.g. every 2 second or every 10 second.
E.g. say you have up to 500 MB retained after a full GC and GCing every couple of seconds is fine, you can have 500 MB + 2 * 250 MB, or a heap of around 1 GB.
The number of RPS is not important.
I have a Java program that operates on a (large) graph. Thus, it uses a significant amount of heap space (~50GB, which is about 25% of the physical memory on the host machine). At one point, the program (repeatedly) picks one node from the graph and does some computation with it. For some nodes, this computation takes much longer than anticipated (30-60 minutes, instead of an expected few seconds). In order to profile these opertations to find out what takes so much time, I have created a test program that creates only a very small part of the large graph and then runs the same operation on one of the nodes that took very long to compute in the original program. Thus, the test program obviously only uses very little heap space, compared to the original program.
It turns out that an operation that took 48 minutes in the original program can be done in 9 seconds in the test program. This really confuses me. The first thought might be that the larger program spends a lot of time on garbage collection. So I turned on the verbose mode of the VM's garbage collector. According to that, no full garbage collections are performed during the 48 minutes, and only about 20 collections in the young generation, which each take less than 1 second.
So my questions is what else could there be that explains such a huge difference in timing? I don't know much about how Java internally organizes the heap. Is there something that takes significantly longer for a large heap with a large number of live objects? Could it be that object allocation takes much longer in such a setting, because it takes longer to find an adequate place in the heap? Or does the VM do any internal reorganization of the heap that might take a lot of time (besides garbage collection, obviously).
I am using Oracle JDK 1.7, if that's of any importance.
While bigger memory might mean bigger problems, I'd say there's nothing (except the GC which you've excluded) what could extend 9 seconds to 48 minutes (factor 320).
A big heap makes seemingly worse spatial locality possible, but I don't think it matters. I disagree with Tim's answer w.r.t. "having to leave the cache for everything".
There's also the TLB which a cache for the virtual address translation, which could cause some problems with very large memory. But again, not factor 320.
I don't think there's anything in the JVM which could cause such problems.
The only reason I can imagine is that you have some swap space which gets used - despite the fact that you have enough physical memory. Even slight swapping can be the cause for a huge slowdown. Make sure it's off (and possibly check swappiness).
Even when things are in memory you have multiple levels of caching of data on modern CPUs. Every time you leave the cache to fetch data the slower that will go. Having 50GB of ram could well mean that it is having to leave the cache for everything.
The symptoms and differences you describe are just massive though and I don't see something as simple as cache coherency making that much difference.
The best advice I can five you is to try running a profiler against it both when it's running slow and when it's running fast and compare the difference.
You need solid numbers and timings. "In this environment doing X took Y time". From that you can start narrowing things down.
I have the following jHiccup result.
Obviously there are huge peaks of few secs in the graph. My app outputs logs every 100 ms or so. When I read my logs I never see such huge pauses. Also I can check the total time spent in GC from the JVM diagnostics and it says the following:
Time:
2013-03-12 01:09:04
Used:
1,465,483 kbytes
Committed:
2,080,128 kbytes
Max:
2,080,128 kbytes
GC time:
2 minutes on ParNew (4,329 collections)
8.212 seconds on ConcurrentMarkSweep (72 collections)
The total big-GC time is around 8 seconds spread over 72 separate collections. All of them are below 200ms per my JVM hint to limit the pauses.
On the other hand I observed exactly one instance of network response time of 5 seconds in my independent network logs (wireshark). That implies the pauses exist, but they are not GC and they are not blocked threads or something that can be observed in profiler or thread dumps.
My question is what would be the best way to debug or tune this behavior?
Additionally, I'd like to understand how jHiccup does the measurement. Obviously it is not GC pause time.
Glad to see you are using jHiccup, and that it seems to show reality-based hiccups.
jHiccup observes "hiccups" that would also be seen by application threads running on the JVM. It does not glean the reason - just reports the fact. Reasons can be anything that would cause a process to not run perfectly ready-to-run code: GC pauses are a common cause, but a temporary ^Z at the keyboard, or one of those "live migration" things across virtualized hosts would be observed just as well.. There are a multitude of possible reasons, including scheduling pressure at the OS or hypervisor level (if one exists), power management craziness, swapping, and many others. I've seen Linux file system pressure and Transparent Huge Page "background" defragmentation cause multi-second hiccups as well...
A good first step at isolating the cause of the pause is to use the "-c" option in jHiccup: It launches a separate control process (with an otherwise idle workload). If both your application and the control process show hiccups that are roughly correlated in size and time, you'll know you are looking for a system-level (as opposed to process-local) reason. If they do not correlate, you'll know to suspect the insides of your JVM - which most likely indicates your JVM paused for something big; either GC or something else, like a lock debiasing or a class-loading-deriven-deoptimization which can take a really long [and often unreported in logs] time on some JVMs if time-to-safepoint is long for some reason (and on most JVMs, there are many possible causes for a long time-to-safepoint).
jHiccup's measurement is so dirt-simple that it's hard to get wrong. The entire thing is less than 650 lines of java code, so you can look at the logic for yourself. jHiccup's HiccupRecorder thread repeatedly goes to sleep for 1msec, and when it wakes up it records any difference in time (from before the sleep) that is greater that 1msec as a hiccup. The simple assumption is that if one ready-to-run thread (the HiccupRecorder) did not get to run for 5 seconds, other threads in the same process also saw a similar sized hiccup.
As you note above, jHiccups observations seem to be corroborated in your independent network logs, where you saw a 5 seconds response time, Note that not all hiccups would have been observed by the network logs, as only requests actually made during the hiccups would have been observed by a network logger. In contrast, no hiccup larger than ~1msec can hide from jHiccup, since it will attempt a wakeup 1,000 times per second even with no other activity.
This may not be GC, but before you rule out GC, I'd suggest you look into the GC logging a bit more. To start with, a JVM hint to limit pauses to 200msec is useless on all known JVMs. A pause hint is the equivalent of saying "please". In addition, don't believe your GC logs unless you include -XX:+PrintGCApplicationStoppedTime in options (and Suspect them even then). There are pauses and parts of pauses that can be very long and go unreported unless you include this flag. E.g. I've seen pauses caused by the occasional long running counted loop taking 15 seconds to reach a safe point, where GC only reported only the .08 seconds part of the pause where it actually did some work. There are also plenty of pauses whose causes that are not considered part of "GC" and can thereby go unreported by GC logging flags.
-- Gil. [jHiccup's author]
I currently have a program that benefits greatly from multithreading. It starts n threads, each thread does 100M iterations. They all use shared memory but there is no synchronization at all.
It approximates some equation solutions and current benchmarks are:
1 thread: precision 1 time: 150s
4 threads: precision 4 time: 150s
16 threads: precision 16 time: 150s
32 threads: precision 32 time: 210s
64 threads: precision 64 time: 420s
(Higher precision is better)
I use Amazon EC2 'Cluster Compute Eight Extra Large Instance' which has 2 x Intel Xeon E5-2670
As far as I understand, it has 16 real cores, thus program has linear improvement up to 16 cores.
Also it has 2x 'hyper-threading' and my program gains somewhat from this. Making number of threads more than 32 is obviously gives no improvement.
These benchmarks prove that access to RAM is not 'bottleneck'.
Also I ran the program on Intel Xeon E5645 which has 12 real cores. Results are:
1 thread: precision 1 time: 150s
4 threads: precision 4 time 150s
12 threads: precision 12 time 150s
24 threads: precision 24 time 220s
precision/(time*thread#) is similar to Amazon computer, which is not clear for me, because each core in Xeon E5-2670 is ~1.5 faster according to cpu MHz (~1600 vs ~2600) and
http://www.cpubenchmark.net/cpu_list.php 'Passmark CPU Mark' number adjusted for
Why using faster processor does not improve single-threaded performance while increasing number of threads does?
Is it possible to rent some server that will have Multiple CPU more powerful than 2 x Intel Xeon E5-2670 while using the shared RAM, so I can run my program without any changes and get better results?
Update:
13 threads on Xeon5645 take 196 seconds.
Algorithm randomly explores tree which has 3500 nodes. Height of tree is 7. Each node contains 250 doubles which are also randomly accessed. It is very likely that almost no data is cached.
Specs on the two Intel CPUs you've listed:
E5-2670 - 2.6ghz minimum [8 active cores] (3.3ghz turbo on a single core)
E5645 - 2.4ghz minimum [6 active cores] (2.8ghz turbo on a single core)
So there is at least one important question to ask yourself here:
Why isn't your app faster as a single core? There is much more of a speed drop scaling up from 1 core to 8 cores on the E5-2670 than there is a speed drop switching to the E5645. You shouldn't notice a linear progression from 1 to 16 threads, even if your app has zero inter-thread locks -- all current-gen CPUs drop clock rate as more threads are added to their workload.
The answer is probably not RAM at least in a basic sense, but it might be "L1/L2 caches". The L1/L2 caches are much more important for application performance than RAM throughput. Modern Intel CPUs are designed around the idea that L1/L2 cache hit rates will likely be good (if not great). If the L1/L2 caches are rendered useless by an algorithm that's churning through megabytes of memory without some frequent reuse pattern, then the CPU will essentially become bottlenecked against the RAM latency.
RAM Latency is Not RAM Throughput
While the throughput of the ram is probably plenty enough to keep up with all your threads over time, the latency is not. Latency reading from RAM is 80-120 cycles, depending on CPU clock multiplier. By comparison, latency reading from L1 is 3 cycles, from L2 11-12 cycles. Therefore, if some portion of your algorithm is always resulting in a fetch from RAM, then that portion will always take a very long time to execute, and approx the same time on different CPUs since the ram latency will be about the same. 100 cycles on a Xeon is long enough that even a single stall against RAM can become the dominant hot-spot in an algo (consider that these chips avg 3 instructions per cycle).
I do not know if this is the actual bottleneck on your application, since I don't know how much data it processes on each iteration, or what access ram patterns it uses. But this is one of the only explanations for having a constant-time algo across many configurations of thread, and across different Xeon CPUs.
(Edit: There's also a shared L3 cache on these Xeon chips, but its helpfulness is pretty limited. The latency on L3 accesses is 50-60 cycles -- better than RAM but not by much. And the chance to hit L3 is pretty slim if both L1/L2 are already ineffective. As mentioned before, these chips are designed with high L1/L2 hit rates in mind: The L3 cache arrangement is built in such a way to compliment occasional misses from L1/L2, and does not do well serving data as a primary cache itself)
Two tipps:
1) set number of threads to num cores + 1.
2) cpu speed tell little, it is also speed and size of first and 2nd level cpu cache. and memory, too. (My Quadcore is nominally 20% faster than my dual core laptop, but in reality with a single thread high cpu application. it is 400 - 800% faster. (caused by faster memory, cpu design, cache, etc.)
Server processing power is often less than that of a private PC because they are more designed for robustness and 24h uptime.