Why does running multiple lambdas in loops suddenly slow down? - java

Consider the following code:
public class Playground {
private static final int MAX = 100_000_000;
public static void main(String... args) {
execute(() -> {});
execute(() -> {});
execute(() -> {});
execute(() -> {});
}
public static void execute(Runnable task) {
Stopwatch stopwatch = Stopwatch.createStarted();
for (int i = 0; i < MAX; i++) {
task.run();
}
System.out.println(stopwatch);
}
}
This currently prints the following on my Intel MBP on Temurin 17:
3.675 ms
1.948 ms
216.9 ms
243.3 ms
Notice the 100* slowdown for the third (and any subsequent) execution. Now, obviously, this is NOT how to write benchmarks in Java. The loop code doesn't do anything, so I'd expect it to be eliminated for all and any repetitions. Also I could not repeat this effect using JMH which tells me the reason is tricky and fragile.
So, why does this happen? Why would there suddenly be such a catastrophic slowdown, what's going on under the hood? An assumption is that C2 bails on us, but which limitation are we bumping into?
Things that don't change the behavior:
using anonymous inner classes instead of lambdas,
using 3+ different nested classes instead of lambdas.
Things that "fix" the behavior. Actually the third invocation and all subsequent appear to be much faster, hinting that compilation correctly eliminated the loops completely:
using 1-2 nested classes instead of lambdas,
using 1-2 lambda instances instead of 4 different ones,
not calling task.run() lambdas inside the loop,
inlining the execute() method, still maintaining 4 different lambdas.

You can actually replicate this with JMH SingleShot mode:
#BenchmarkMode(Mode.SingleShotTime)
#Warmup(iterations = 0)
#Measurement(iterations = 1)
#Fork(1)
public class Lambdas {
#Benchmark
public static void doOne() {
execute(() -> {});
}
#Benchmark
public static void doFour() {
execute(() -> {});
execute(() -> {});
execute(() -> {});
execute(() -> {});
}
public static void execute(Runnable task) {
for (int i = 0; i < 100_000_000; i++) {
task.run();
}
}
}
Benchmark Mode Cnt Score Error Units
Lambdas.doFour ss 0.446 s/op
Lambdas.doOne ss 0.006 s/op
If you look at -prof perfasm profile for doFour test, you would get a fat clue:
....[Hottest Methods (after inlining)]..............................................................
32.19% c2, level 4 org.openjdk.Lambdas$$Lambda$44.0x0000000800c258b8::run, version 664
26.16% c2, level 4 org.openjdk.Lambdas$$Lambda$43.0x0000000800c25698::run, version 658
There are at least two hot lambdas, and those are represented by different classes. So what you are seeing is likely monomorphic (one target), then bimorphic (two targets), then polymorphic virtual call at task.run.
Virtual call has to choose which class to call the implementation from. The more classes you have, the worse it gets for optimizer. JVM tries to adapt, but it gets worse and worse as the run progresses. Roughly like this:
execute(() -> {}); // compiles with single target, fast
execute(() -> {}); // recompiles with two targets, a bit slower
execute(() -> {}); // recompiles with three targets, slow
execute(() -> {}); // continues to be slow
Now, the elimination of the loop requires seeing through the task.run(). In monomorphic and bimorphic cases it is easy: one or both targets are inlined, their empty body is discovered, done. In both cases, you would have to do typechecks, which means it is not completely free, with bimorphic costing a bit extra. When you experience a polymorphic call, there is no such luck at all: it is opaque call.
You can add two more benchmarks in the mix to see it:
#Benchmark
public static void doFour_Same() {
Runnable l = () -> {};
execute(l);
execute(l);
execute(l);
execute(l);
}
#Benchmark
public static void doFour_Pair() {
Runnable l1 = () -> {};
Runnable l2 = () -> {};
execute(l1);
execute(l1);
execute(l2);
execute(l2);
}
Which would then yield:
Benchmark Mode Cnt Score Error Units
Lambdas.doFour ss 0.445 s/op ; polymorphic
Lambdas.doFour_Pair ss 0.016 s/op ; bimorphic
Lambdas.doFour_Same ss 0.008 s/op ; monomorphic
Lambdas.doOne ss 0.006 s/op
This also explains why your "fixes" help:
using 1-2 nested classes instead of lambdas,
Bimorphic inlining.
using 1-2 lambda instances instead of 4 different ones,
Bimorphic inlining.
not calling task.run() lambdas inside the loop,
Avoids polymorphic (opaque) call in the loop, allows loop elimination.
inlining the execute() method, still maintaining 4 different lambdas.
Avoids a single call site that experiences multiple call targets. In other words, turns a single polymorphic call site into a series of monomorphic call sites each with its own target.

Related

MethodHandles.lookup().lookupClass() vs getClass()

Can anyone tell me the (subtle) differences of
version 1:
protected final Logger log = Logger.getLogger(getClass());
vs
version 2:
protected final Logger log = Logger.getLogger(MethodHandles.lookup().lookupClass());
Is version 2 in general faster than version 1?
I guess version 1 uses reflection (on runtime) to determine the current class while version 2 does not need to use reflection, or (is the check done on build time)?
There is no reflection involved in you first case. Object#getClass() is mapped to JVM's native method.
Your second case is not drop-in replacement for Object#getClass(), it is used to lookup method handles.
So subtle difference is, they are used for completely different purposes.
These are entirely different things. The documentation of lookupClass, specifically says:
Tells which class is performing the lookup. It is this class against which checks are performed for visibility and access permissions
So it's the class which performs the lookup. It's not necessarily the class where you call MethodHandles.lookup(). What I mean by that is that this:
Class<?> c = MethodHandles.privateLookupIn(String.class, MethodHandles.lookup()).lookupClass();
System.out.println(c);
will print String.class and not the class where you define this code.
The only "advantage" (besides confusing every reader of this code), is that if you copy/paste that log creation line across various source files, it will use the proper class, if you, by accident, don't edit it (which probably happens).
Also notice that:
protected final Logger log = Logger.getLogger(getClass());
should be a static field, usually, and you can't call getClass if it is.
A JMH test shows that there is no performance gain to obfuscate your code that much:
#State(Scope.Benchmark)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
#Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
public class LookupTest {
private static final MethodHandles.Lookup LOOKUP = MethodHandles.lookup();
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder().include(LookupTest.class.getSimpleName())
.verbosity(VerboseMode.EXTRA)
.build();
new Runner(opt).run();
}
#Benchmark
#Fork(3)
public Class<?> getClassCall() {
return getClass();
}
#Benchmark
#Fork(3)
public Class<?> methodHandlesInPlaceCall() {
return MethodHandles.lookup().lookupClass();
}
#Benchmark
#Fork(3)
public Class<?> methodHandlesCall() {
return LOOKUP.lookupClass();
}
}
results:
Benchmark Mode Cnt Score Error Units
LookupTest.getClassCall avgt 15 2.264 ± 0.044 ns/op
LookupTest.methodHandlesCall avgt 15 2.262 ± 0.030 ns/op
LookupTest.methodHandlesInPlaceCall avgt 15 4.890 ± 0.783 ns/op

Completion order of CompletableFuture

I'm curious to understand the completion order when a CompletableFuture has multiple dependents. I would have expected the dependents to complete in the order they were added, but that doesn't seem to be the case.
In particular, this behaviour is surprising:
import java.util.concurrent.CompletableFuture;
public class Main {
public static void main(String[] args) {
{
var x = new CompletableFuture<Void>();
x.thenRun(() -> System.out.println(1));
x.thenRun(() -> System.out.println(2));
x.thenRun(() -> System.out.println(3));
x.complete(null);
}
{
var x = new CompletableFuture<Void>();
var y = x.copy();
y.thenRun(() -> System.out.println(1));
y.thenRun(() -> System.out.println(2));
y.thenRun(() -> System.out.println(3));
x.complete(null);
}
}
}
...results in the the following output...
3
2
1
1
2
3
All 3 child futures generated by 'x.thenRun()' expression run in parallel, so you cannot expect any particular order of printed numbers.
The same for 'y.thenRun()'.
The expressed order is an implementation feature, can change in the future, and is not supported by the specification.
So you have not chained those thenRun, as such there are zero guarantees on the order, but you expect some order? Sorry, that's not how it works.
The documentation does not say anything about order of execution in such cases, so whatever you see now in the output:
is not a guarantee
can change from run to run (and from java version to version)
It is a little bit hard to prove via code with your current set-up, but just change it to :
var x = CompletableFuture.runAsync(() -> {
LockSupport.parkNanos(TimeUnit.SECONDS.toNanos(1));
});
x.thenRun(() -> System.out.println(1));
x.thenRun(() -> System.out.println(2));
x.thenRun(() -> System.out.println(3));
x.join();
And run this let's say 20 times, I bet that at least once, you will not see 3, 2, 1, but something different.

JMH - why JIT does not eliminate my dead-code

I wrote two benchmarks to demonstrate that JIT can be a problem with writing fine benchmark (Please skip that I doesnt use #State here):
#Fork(value = 1)
#Warmup(iterations = 2, time = 10)
#Measurement(iterations = 3, time = 2)
#BenchmarkMode(Mode.AverageTime)
public class DeadCodeTraps {
#Benchmark
#OutputTimeUnit(TimeUnit.MICROSECONDS)
public static void summaryStatistics_standardDeviationForFourNumbers() {
final SummaryStatistics summaryStatistics = new SummaryStatistics();
summaryStatistics.addValue(10.0);
summaryStatistics.addValue(20.0);
summaryStatistics.addValue(30.0);
summaryStatistics.addValue(40.0);
summaryStatistics.getStandardDeviation();
}
#Benchmark
#OutputTimeUnit(TimeUnit.MICROSECONDS)
public static void summaryStatistics_standardDeviationForTenNumbers() {
final SummaryStatistics summaryStatistics = new SummaryStatistics();
summaryStatistics.addValue(10.0);
summaryStatistics.addValue(20.0);
summaryStatistics.addValue(30.0);
summaryStatistics.addValue(40.0);
summaryStatistics.addValue(50.0);
summaryStatistics.addValue(60.0);
summaryStatistics.addValue(70.0);
summaryStatistics.addValue(80.0);
summaryStatistics.addValue(90.0);
summaryStatistics.addValue(100.0);
summaryStatistics.getStandardDeviation();
}
}
I thought that JIT will eliminate dead code, so two methods will be executed at the same time. But in the end, I have:
summaryStatistics_standardDeviationForFourNumbers 0.158 ± 0.046
DeadCodeTraps.summaryStatistics_standardDeviationForTenNumbers 0.359 ± 0.294
Why JIT does not optimize it? The result of summaryStatistics.getStandardDeviation(); is not used anywhere outside the method and it is not returned by it.
(I am using OpenJDK build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)
If you're talking about the Apache Commons Math SummaryStatistics class, then it's a massive class. Its construction will most certainly not be inlined. To see why, run with -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:-BackgroundCompilation
Dead code elimination happens after inlining.
Unused objects will back-propagate, but the non-inlined constructor will break the chain because the JIT optimizer can no longer be sure there are no side effects.
In other words, the code you expect to be eliminated is too big.

Why is CompletableFuture.supplyAsync succeeding a random number of times?

I'm new to both lambdas and asynchronous code in Java 8. I keep getting some weird results...
I have the following code:
import java.util.concurrent.CompletableFuture;
public class Program {
public static void main(String[] args) {
for (int i = 0; i < 100; i++) {
String test = "Test_" + i;
final int a = i;
CompletableFuture<Boolean> cf = CompletableFuture.supplyAsync(() -> doPost(test));
cf.thenRun(() -> System.out.println(a)) ;
}
}
private static boolean doPost(String t) {
System.out.println(t);
return true;
}
}
The actual code is a lot longer, as the doPost method will post some data to a web service. However, I'm able to replicate my issue with this bare-bones code.
I want to have the doPost method execute 100 times, but asynchronously for performance reasons (in order to push data to the web service faster than doing 100 synchronous calls would be).
In the code above, the ´doPost´ method is run a random amount of times, but always no more than 20-25 times. There are no exceptions thrown. It seems that either some thread handling mechanism is silently refusing to create new threads and execute their code, or the threads are silently crashing without crashing the program.
I also have an issue where, if I add more functionality to the doPost method than shown above, it reaches a point where the method simply silently breaks. I've tried adding a System.out.println("test") right before the return statement in that case, but it is never called. The loop which loops 100 times does run 100 iterations though.
This behaviour is confusing, to say the least.
What am I missing? Why is the function supplied as an argument to supplyAsync run a seemingly random number of times?
EDIT: Just wanted to point out that the situation is not exactly the same as in the question this was marked as a possible duplicate of, as that question dealt with arbitrarily deeply nested futures, and this one deals with parallell ones. However, the reason why they are failing is virtually identical. The cases seem distinct enough to merit separate questions to me, but others might disagree...
By default CompletableFuture uses own ForkJoinPool.commonPool() (see CompletableFuture implementation). And this default pool creates only daemon threads, e.g. they won't block the main application from terminating if they still alive.
You have the following choices:
Collect all CompletionStage to some array and then make java.util.concurrent.CompletableFuture#allOf().toCompletableFuture().join() - this will guarantee all the stages are completed before going after join()
Use *Async operations with your own thread pool which contains only non-daemon threads, like in the following example:
public static void main(String[] args) throws InterruptedException {
ExecutorService pool = Executors.newFixedThreadPool(10, r -> {
Thread t = new Thread(r);
t.setDaemon(false); // must be not daemon
return t;
});
for (int i = 0; i < 100; i++) {
final int a = i;
// the operation must be Async with our thread pool
CompletableFuture<Boolean> cf = CompletableFuture.supplyAsync(() -> doPost(a), pool);
cf.thenRun(() -> System.out.printf("%s: Run_%s%n", Thread.currentThread().getName(), a));
}
pool.shutdown(); // without this the main application will be blocked forever
}
private static boolean doPost(int t) {
System.out.printf("%s: Post_%s%n", Thread.currentThread().getName(), t);
return true;
}

Java Reflection Performance

Does creating an object using reflection rather than calling the class constructor result in any significant performance differences?
Yes - absolutely. Looking up a class via reflection is, by magnitude, more expensive.
Quoting Java's documentation on reflection:
Because reflection involves types that are dynamically resolved, certain Java virtual machine optimizations can not be performed. Consequently, reflective operations have slower performance than their non-reflective counterparts, and should be avoided in sections of code which are called frequently in performance-sensitive applications.
Here's a simple test I hacked up in 5 minutes on my machine, running Sun JRE 6u10:
public class Main {
public static void main(String[] args) throws Exception
{
doRegular();
doReflection();
}
public static void doRegular() throws Exception
{
long start = System.currentTimeMillis();
for (int i=0; i<1000000; i++)
{
A a = new A();
a.doSomeThing();
}
System.out.println(System.currentTimeMillis() - start);
}
public static void doReflection() throws Exception
{
long start = System.currentTimeMillis();
for (int i=0; i<1000000; i++)
{
A a = (A) Class.forName("misc.A").newInstance();
a.doSomeThing();
}
System.out.println(System.currentTimeMillis() - start);
}
}
With these results:
35 // no reflection
465 // using reflection
Bear in mind the lookup and the instantiation are done together, and in some cases the lookup can be refactored away, but this is just a basic example.
Even if you just instantiate, you still get a performance hit:
30 // no reflection
47 // reflection using one lookup, only instantiating
Again, YMMV.
Yes, it's slower.
But remember the damn #1 rule--PREMATURE OPTIMIZATION IS THE ROOT OF ALL EVIL
(Well, may be tied with #1 for DRY)
I swear, if someone came up to me at work and asked me this I'd be very watchful over their code for the next few months.
You must never optimize until you are sure you need it, until then, just write good, readable code.
Oh, and I don't mean write stupid code either. Just be thinking about the cleanest way you can possibly do it--no copy and paste, etc. (Still be wary of stuff like inner loops and using the collection that best fits your need--Ignoring these isn't "unoptimized" programming, it's "bad" programming)
It freaks me out when I hear questions like this, but then I forget that everyone has to go through learning all the rules themselves before they really get it. You'll get it after you've spent a man-month debugging something someone "Optimized".
EDIT:
An interesting thing happened in this thread. Check the #1 answer, it's an example of how powerful the compiler is at optimizing things. The test is completely invalid because the non-reflective instantiation can be completely factored out.
Lesson? Don't EVER optimize until you've written a clean, neatly coded solution and proven it to be too slow.
You may find that A a = new A() is being optimised out by the JVM.
If you put the objects into an array, they don't perform so well. ;)
The following prints...
new A(), 141 ns
A.class.newInstance(), 266 ns
new A(), 103 ns
A.class.newInstance(), 261 ns
public class Run {
private static final int RUNS = 3000000;
public static class A {
}
public static void main(String[] args) throws Exception {
doRegular();
doReflection();
doRegular();
doReflection();
}
public static void doRegular() throws Exception {
A[] as = new A[RUNS];
long start = System.nanoTime();
for (int i = 0; i < RUNS; i++) {
as[i] = new A();
}
System.out.printf("new A(), %,d ns%n", (System.nanoTime() - start)/RUNS);
}
public static void doReflection() throws Exception {
A[] as = new A[RUNS];
long start = System.nanoTime();
for (int i = 0; i < RUNS; i++) {
as[i] = A.class.newInstance();
}
System.out.printf("A.class.newInstance(), %,d ns%n", (System.nanoTime() - start)/RUNS);
}
}
This suggest the difference is about 150 ns on my machine.
If there really is need for something faster than reflection, and it's not just a premature optimization, then bytecode generation with ASM or a higher level library is an option. Generating the bytecode the first time is slower than just using reflection, but once the bytecode has been generated, it is as fast as normal Java code and will be optimized by the JIT compiler.
Some examples of applications which use code generation:
Invoking methods on proxies generated by CGLIB is slightly faster than Java's dynamic proxies, because CGLIB generates bytecode for its proxies, but dynamic proxies use only reflection (I measured CGLIB to be about 10x faster in method calls, but creating the proxies was slower).
JSerial generates bytecode for reading/writing the fields of serialized objects, instead of using reflection. There are some benchmarks on JSerial's site.
I'm not 100% sure (and I don't feel like reading the source now), but I think Guice generates bytecode to do dependency injection. Correct me if I'm wrong.
"Significant" is entirely dependent on context.
If you're using reflection to create a single handler object based on some configuration file, and then spending the rest of your time running database queries, then it's insignificant. If you're creating large numbers of objects via reflection in a tight loop, then yes, it's significant.
In general, design flexibility (where needed!) should drive your use of reflection, not performance. However, to determine whether performance is an issue, you need to profile rather than get arbitrary responses from a discussion forum.
There is some overhead with reflection, but it's a lot smaller on modern VMs than it used to be.
If you're using reflection to create every simple object in your program then something is wrong. Using it occasionally, when you have good reason, shouldn't be a problem at all.
Yes there is a performance hit when using Reflection but a possible workaround for optimization is caching the method:
Method md = null; // Call while looking up the method at each iteration.
millis = System.currentTimeMillis( );
for (idx = 0; idx < CALL_AMOUNT; idx++) {
md = ri.getClass( ).getMethod("getValue", null);
md.invoke(ri, null);
}
System.out.println("Calling method " + CALL_AMOUNT+ " times reflexively with lookup took " + (System.currentTimeMillis( ) - millis) + " millis");
// Call using a cache of the method.
md = ri.getClass( ).getMethod("getValue", null);
millis = System.currentTimeMillis( );
for (idx = 0; idx < CALL_AMOUNT; idx++) {
md.invoke(ri, null);
}
System.out.println("Calling method " + CALL_AMOUNT + " times reflexively with cache took " + (System.currentTimeMillis( ) - millis) + " millis");
will result in:
[java] Calling method 1000000 times reflexively with lookup took 5618 millis
[java] Calling method 1000000 times reflexively with cache took 270 millis
Interestingly enough, settting setAccessible(true), which skips the security checks, has a 20% reduction in cost.
Without setAccessible(true)
new A(), 70 ns
A.class.newInstance(), 214 ns
new A(), 84 ns
A.class.newInstance(), 229 ns
With setAccessible(true)
new A(), 69 ns
A.class.newInstance(), 159 ns
new A(), 85 ns
A.class.newInstance(), 171 ns
Reflection is slow, though object allocation is not as hopeless as other aspects of reflection. Achieving equivalent performance with reflection-based instantiation requires you to write your code so the jit can tell which class is being instantiated. If the identity of the class can't be determined, then the allocation code can't be inlined. Worse, escape analysis fails, and the object can't be stack-allocated. If you're lucky, the JVM's run-time profiling may come to the rescue if this code gets hot, and may determine dynamically which class predominates and may optimize for that one.
Be aware the microbenchmarks in this thread are deeply flawed, so take them with a grain of salt. The least flawed by far is Peter Lawrey's: it does warmup runs to get the methods jitted, and it (consciously) defeats escape analysis to ensure the allocations are actually occurring. Even that one has its problems, though: for example, the tremendous number of array stores can be expected to defeat caches and store buffers, so this will wind up being mostly a memory benchmark if your allocations are very fast. (Kudos to Peter on getting the conclusion right though: that the difference is "150ns" rather than "2.5x". I suspect he does this kind of thing for a living.)
Yes, it is significantly slower. We were running some code that did that, and while I don't have the metrics available at the moment, the end result was that we had to refactor that code to not use reflection. If you know what the class is, just call the constructor directly.
In the doReflection() is the overhead because of Class.forName("misc.A") (that would require a class lookup, potentially scanning the class path on the filsystem), rather than the newInstance() called on the class. I am wondering what the stats would look like if the Class.forName("misc.A") is done only once outside the for-loop, it doesn't really have to be done for every invocation of the loop.
Yes, always will be slower create an object by reflection because the JVM cannot optimize the code on compilation time. See the Sun/Java Reflection tutorials for more details.
See this simple test:
public class TestSpeed {
public static void main(String[] args) {
long startTime = System.nanoTime();
Object instance = new TestSpeed();
long endTime = System.nanoTime();
System.out.println(endTime - startTime + "ns");
startTime = System.nanoTime();
try {
Object reflectionInstance = Class.forName("TestSpeed").newInstance();
} catch (InstantiationException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
endTime = System.nanoTime();
System.out.println(endTime - startTime + "ns");
}
}
Often you can use Apache commons BeanUtils or PropertyUtils which introspection (basically they cache the meta data about the classes so they don't always need to use reflection).
I think it depends on how light/heavy the target method is. if the target method is very light(e.g. getter/setter), It could be 1 ~ 3 times slower. if the target method takes about 1 millisecond or above, then the performance will be very close. here is the test I did with Java 8 and reflectasm :
public class ReflectionTest extends TestCase {
#Test
public void test_perf() {
Profiler.run(3, 100000, 3, "m_01 by refelct", () -> Reflection.on(X.class)._new().invoke("m_01")).printResult();
Profiler.run(3, 100000, 3, "m_01 direct call", () -> new X().m_01()).printResult();
Profiler.run(3, 100000, 3, "m_02 by refelct", () -> Reflection.on(X.class)._new().invoke("m_02")).printResult();
Profiler.run(3, 100000, 3, "m_02 direct call", () -> new X().m_02()).printResult();
Profiler.run(3, 100000, 3, "m_11 by refelct", () -> Reflection.on(X.class)._new().invoke("m_11")).printResult();
Profiler.run(3, 100000, 3, "m_11 direct call", () -> X.m_11()).printResult();
Profiler.run(3, 100000, 3, "m_12 by refelct", () -> Reflection.on(X.class)._new().invoke("m_12")).printResult();
Profiler.run(3, 100000, 3, "m_12 direct call", () -> X.m_12()).printResult();
}
public static class X {
public long m_01() {
return m_11();
}
public long m_02() {
return m_12();
}
public static long m_11() {
long sum = IntStream.range(0, 10).sum();
assertEquals(45, sum);
return sum;
}
public static long m_12() {
long sum = IntStream.range(0, 10000).sum();
assertEquals(49995000, sum);
return sum;
}
}
}
The complete test code is available at GitHub:ReflectionTest.java

Categories

Resources