newInstance() vs new - java

Is there a penalty for calling newInstance() or is it the same mechanism underneath?
How much overhead if any does newInstance() have over the new keyword* ?
*: Discounting the fact that newInstance() implies using reflection.

In a real world test, creating 18129 instances of a class via "Constuctor.newInstance" passing in 10 arguments -vs- creating the instances via "new" the program took no measurable difference in time.
This wasn't any sort of micro benchmark.
This is with JDK 1.6.0_12 on Windows 7 x86 beta.
Given that Constructor.newInstance is going to be very simlilar to Class.forName.newInstance I would say that the overhead is hardly anything given the functionality that you can get using newInstance over new.
As always you should test it yourself to see.

Be wary of microbenchmarks, but I found this blog entry where someone found that new was about twice as fast as newInstance with JDK 1.4. It sounds like there is a difference, and as expected, reflection is slower. However, it sounds like the difference may not be a dealbreaker, depending on the frequency of new object instances vs how much computation is being done.

There's a difference - in 10 times. Absolute difference is tiny but if your write low latency application cumulative CPU time lost might be significant.
long start, end;
int X = 10000000;
ArrayList l = new ArrayList(X);
start = System.nanoTime();
for(int i = 0; i < X; i++){
String.class.newInstance();
}
end = System.nanoTime();
log("T: ", (end - start)/X);
l.clear();
start = System.nanoTime();
for(int i = 0; i < X; i++){
new String();
}
end = System.nanoTime();
log("T: ", (end - start)/X);
Output:
T: 105
T: 11
Tested on Xeon W3565 # 3.2Ghz, Java 1.6

Related

Basic (arithmetic) operations and their dependence on JVM and CPU

In Java I want to measure time for
1000 integer comparisons ("<" operator),
1000 integer additions (a+b
each case for different a and b),
another simple operations.
I know I can do it in the following way:
Random rand = new Random();
long elapsedTime = 0;
for (int i = 0; i < 1000; i++) {
int a = Integer.MIN_VALUE + rand.nextInt(Integer.MAX_VALUE);
int b = Integer.MIN_VALUE + rand.nextInt(Integer.MAX_VALUE);
long start = System.currentTimeMillis();
if (a < b) {}
long stop = System.currentTimeMillis();
elapsedTime += (start - stop);
}
System.out.println(elapsedTime);
I know that this question may seem somehow not clear.
How those values depend on my processor (i.e. relation between time for those operations and my processor) and JVM? Any suggestions?
I'm looking for understandable readings...
How those values depend on my processor (i.e. relation between time for those operations and my processor) and JVM? Any suggestions?
It is not dependant on your processor, at least not directly.
Normally, when you run code enough, it will compile it to native code. When it does this, it removes code which doesn't do anything, so what you will be doing here is measuring the time it takes to perform a System.currentMillis(), which is typically about 0.00003 ms. This means you will get 0 99.997% of the time and see a 1 very rarely.
I say normally, but in this case your code won't be compiled to native code, as the default threshold is 10,000 iterations. I.e. you would be testing how long it takes the interpretor to execute the byte code. This is much slower, but would still be a fraction of a milli-second. i.e. you have higher chance seeing a 1 but still unlikely.
If you want to learn more about low level benchmarking in Java, I suggest you read JMH and the Author's blog http://shipilev.net/
If you want to see what machine code is generated from Java code I suggest you try JITWatch

The costs of streams and closures in Java 8

I'm rewriting an application that involves dealing with objects in order of 10 millions using Java 8 and I noticed that streams can slow down the application up to 25%. Interestingly, this happens when my collections are empty as well, so it's the constant initialization time of stream. To reproduce the problem, consider the following code:
long start = System.nanoTime();
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
set.stream().forEach(s -> System.out.println(s));
}
long end = System.nanoTime();
System.out.println((end - start)/1000_000);
start = System.nanoTime();
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
for (String s : set) {
System.out.println(s);
}
}
end = System.nanoTime();
System.out.println((end - start)/1000_000);
The result is as follows: 224 vs. 5 ms.
If I use forEach on set directly, i.e., set.forEach(), the result will be: 12 vs 5ms.
Finally, if I create the closure outside once as
Consumer<? super String> consumer = s -> System.out.println(s);
and use set.forEach(c) the result will be 7 vs 5 ms.
Of course, the nubmers are small and my benchmarking is very primitive, but does this example shows that there is an overhead in initializing streams and closures?
(Actually, since set is empty, the initialization cost of closures should not be important in this case, but nevertheless, should I consider creating closures before hand instead of on-the-fly)
The cost you see here is not associated with the "closures" at all but with the cost of Stream initialization.
Let's take your three sample codes:
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
set.stream().forEach(s -> System.out.println(s));
}
This one creates a new Stream instance at each loop; at least for the first 10k iterations, see below. After those 10k iterations, well, the JIT is probably smart enough to see that it's a no-op anyway.
for (int i = 0; i < 10_000_000; i++) {
Set<String> set = Collections.emptySet();
for (String s : set) {
System.out.println(s);
}
}
Here the JIT kicks in again: empty set? Well, that's a no-op, end of story.
set.forEach(System.out::println);
An Iterator is created for the set, which is always empty? Same story, the JIT kicks in.
The problem with your code to start with is that you fail to account for the JIT; for realistic measurements, run at least 10k loops before measuring, since 10k executions is what the JIT requires to kick in (at least, HotSpot acts this way).
Now, lambdas: they are call sites, and they are linked only once; but the cost of the initial linkage is still there, of course, and in your loops, you include this cost. Try and run only one loop before doing your measurements so that this cost is out of the way.
All in all, this is not a valid microbenchmark. Use caliper, or jmh, to really measure the performance.
An excellent video to see how lambdas work here. It is a little old now, and the JVM is much better than it was at this time with lambdas.
If you want to know more, look for literature about invokedynamic.

Java faster than C [duplicate]

This question already has answers here:
How do I write a correct micro-benchmark in Java?
(11 answers)
Closed 9 years ago.
Today I made a simple test to compare the speed between java and c - a simple loop that makes an integer "i" increment from 0 to two billion.
I really expected c-language to be faster than java. I was surprised of the outcome:
the time it takes in seconds for java: approx. 1.8 seconds
the time it takes in seconds for c: approx. 3.6 seconds.
I DO NOT think that java is a faster language at all, but I DO NOT either understand why the loop is twice as fast as c in my simple programs?
Did I made a crucial misstake in the program? Or is the compiler of MinGW badly configurated or something?
public class Jrand {
public static void main (String[] args) {
long startTime = System.currentTimeMillis();
int i;
for (i = 0; i < 2000000000; i++) {
// Do nothing!
}
long endTime = System.currentTimeMillis();
float totalTime = (endTime - startTime);
System.out.println("time: " + totalTime/1000);
}
}
THE C-PROGRAM
#include<stdio.h>
#include<stdlib.h>
#include <time.h>
int main () {
clock_t startTime;
startTime = clock();
int i;
for (i = 0; i <= 2000000000; i++) {
// Do nothing
}
clock_t endTime;
endTime = clock();
float totalTime = endTime - startTime;
printf("%f", totalTime/1000);
return 0;
}
Rebuild your C version with any optimization level other than -O0 (e.g. -O2) and you will find it runs in 0 seconds. So the Java version takes 1.6 seconds to do nothing, and the C version takes 0.0 seconds (really, around 0.00005 seconds) to do nothing.
Java is more aggressive at eliminating code which doesn't do anything. It is less likely to assume the developer knows what they are doing. You are not timing the loop but how long it takes java to detect and eliminate the loop.
In short, Java is often faster at doing nothing useful.
Also you may find that if you optimise the C code and remove debugging information it will do the same thing, most likely shorter.
If you want to benchmark this, instead of doing nothing, try to something useful like calculating something on each iterations. For e.g. count the loops in some other variable, and make sure you use it at the end (by printing it for e.g), so that it will not be optimized out.
Alternate simple tests could be accessing an array linearly (reading only), copying elements from one array to another (read+write), or doing some operations on the data. Some of these cases might be interesting as they open several very simple compiler optimizations that you can later see in the result binary/bytecode, such as loop unrolling, register allocation, and maybe even more complicated stuff like vectorization or code motion. Java on the other may employ some nastier tricks such as jitting (dynamically recompiling on the fly)
The scope of compiler optimization is huge, you've just encountered the most basic one - eliminating useless code :)

What is the actual time of execution of the method in Java and what does it depend?

I have one problem that I can't explain. Here is the code in main function:
String numberStr = "3151312423412354315";
System.out.println(numberStr + "\n");
System.out.println("Lehman method: ");
long beginTime = System.currentTimeMillis();
System.out.println(Lehman.getFullFactorization(numberStr));
long finishTime = System.currentTimeMillis();
System.out.println((finishTime-beginTime)/1000. + " sec.");
System.out.println();
System.out.println("Lehman method: ");
beginTime = System.currentTimeMillis();
System.out.println(Lehman.getFullFactorization(numberStr));
finishTime = System.currentTimeMillis();
System.out.println((finishTime-beginTime)/1000. + " sec.");
If it is necessary: method Lehman.getFullFactorization(...) returns the ArrayList of prime divisors in String format.
Here is the output:
3151312423412354315
Lehman method:
[5, 67, 24473, 384378815693]
0.149 sec.
Lehman method:
[5, 67, 24473, 384378815693]
0.016 sec.
I was surprised, when I saw it. Why a second execution of the same method much faster than first? Firstly, I thought that at the first running of the method it calculates time with time of running JVM and its resources, but it's impossible, because obviously JVM starts before execution of the "main" method.
In some cases, Java's JIT compiler (see http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html#jit) kicks in on the first execution of a method and performs optimizations of that methods code. That is supposed to make all subsequent executions faster. I think this might be what happens in your case.
Try doing it more than 10,000 times and it will be much faster. This is because the code first has to be loaded (expensive) then runs in interpreted mode (ok speed) and is finally compiled to native code (much faster)
Can you try this?
int runs = 100*1000;
for(int i = -20000 /* warmup */; i < runs; i++) {
if(i == 0)
beginTime = System.nanoTime();
Lehman.getFullFactorization(numberStr);
}
finishTime = System.nanoTime();
System.out.println("Average time was " + (finishTime-beginTime)/1e9/runs. + " sec.");
I suppose the JVM has cached the results (may be particularly) of the first calculation and you observe the faster second calculation. JIT in action.
There are two things that make the second run faster.
The first time, the class containing the method must be loaded. The second time, it is already in memory.
Most importantly, the JIT optimizes code that is often executed: during the first call, the JVM starts by interpreting the byte code and then compiles it into machine code and continues the execution. The second time, the code is already compiled.
That's why micro-benchmarks in Java are often hard to validate.
My guess is that it's saved in the L1/L2 cache on the CPU for optimization.
Or Java doesn't have to interpret it again and recalls it from the memory as already compiled code.

C++ and Java performance

this question is just speculative.
I have the following implementation in C++:
using namespace std;
void testvector(int x)
{
vector<string> v;
char aux[20];
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
{
sprintf(aux, "%d", i);
v.push_back(s + aux);
}
}
int main()
{
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i);
}
}
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I know the Java HotSpot performs a lot of optimizations when translating to native, but I think if such performance can be done in Java, it could be implemented in C++ too...
So, what do you think that should be modified in the program above or, I dunno, in the libraries used or in the memory allocator to reach similar performances in this stuff? (writing actual code of these things can be very long, so, discussing about it would be great)...
Thank you.
You have to be careful with performance tests because it's very easy to deceive yourself or not compare like with like.
However, I've seen similar results comparing C# with C++, and there are a number of well-known blog posts about the astonishment of native coders when confronted with this kind of evidence. Basically a good modern generational compacting GC is very much more optimised for lots of small allocations.
In C++'s default allocator, every block is treated the same, and so are averagely expensive to allocate and free. In a generational GC, all blocks are very, very cheap to allocate (nearly as cheap as stack allocation) and if they turn out to be short-lived then they are also very cheap to clean up.
This is why the "fast performance" of C++ compared with more modern languages is - for the most part - mythical. You have to hand tune your C++ program out of all recognition before it can compete with the performance of an equivalent naively written C# or Java program.
All your program does is print the numbers 0..9000 in steps of 1000. The calls to testvector() do nothing and can be eliminated. I suspect that your JVM notices this, and is essentially optimising the whole function away.
You can achieve a similar effect in your C++ version by just commenting out the call to testvector()!
Well, this is a pretty useless test that only measures allocation of small objects.
That said, simple changes made me get the running time down from about 15 secs to about 4 secs. New version:
typedef vector<string, boost::pool_allocator<string> > str_vector;
void testvector(int x, str_vector::iterator it, str_vector::iterator end)
{
char aux[25] = "X-";
int a = x * 2000;
for (; it != end; ++a)
{
sprintf(aux+2, "%d", a);
*it++ = aux;
}
}
int main(int argc, char** argv)
{
str_vector v(2000);
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i, v.begin(), v.begin()+2000);
}
return 0;
}
real 0m4.089s
user 0m3.686s
sys 0m0.000s
Java version has the times:
real 0m2.923s
user 0m2.490s
sys 0m0.063s
(This is my direct java port of your original program, except it passes the ArrayList as a parameter to cut down on useless allocations).
So, to sum up, small allocations are faster on java, and memory management is a bit more hassle in C++. But we knew that already :)
Hotspot optimises hot spots in code. Typically, anything that gets executed 10000 times it tries to optimise.
For this code, after 5 iterations it will try and optimise the inner loop adding the strings to the vector. The optimisation it will do more than likely will include escape analyi o the variables in the method. A the vector is a local variable and never escapes local context, it is very likely that it will remove all of the code in the method and turn it into a no op. To test this, try returning the results from the method. Even then, be careful to do something meaningful with the result - just getting it's length for example can be optimised as horpsot can see the result is alway the same a s the number of iterations in the loop.
All of this points to the key benefit of a dynamic compiler like hotspot - using runtime analysis you can optimise what is actually being done at runtime and get rid of redundant code. After all, it doesn't matter how efficient your custom C++ memory allocator is - not executing any code is always going to be faster.
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I cannot reproduce that result.
To account for the optimization mentioned by Alex, I’ve modified the codes so that both the Java and the C++ code printed the last result of the v vector at the end of the testvector method.
Now, the C++ code (compiled with -O3) runs about as fast as yours (12 sec). The Java code (straightforward, uses ArrayList instead of Vector although I doubt that this would impact the performance, thanks to escape analysis) takes about twice that time.
I did not do a lot of testing so this result is by no means significant. It just shows how easy it is to get these tests completely wrong, and how little single tests can say about real performance.
Just for the record, the tests were run on the following configuration:
$ uname -ms
Darwin i386
$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)
It should help if you use Vector::reserve to reserve space for z elements in v before the loop (however the same thing should also speed up the java equivalent of this code).
To suggest why the performance both C++ and java differ it would essential to see source for both, I can see a number of performance issues in the C++, for some it would be useful to see if you were doing the same in the java (e.g. flushing the output stream via std::endl, do you call System.out.flush() or just append a '\n', if the later then you've just given the java a distinct advantage)?
What are you actually trying to measure here? Putting ints into a vector?
You can start by pre-allocating space into the vector with the know size of the vector:
instead of:
void testvector(int x)
{
vector<string> v;
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
v.push_back(i);
}
try:
void testvector(int x)
{
int a = x * 2000;
int z = a + 2000;
string s("X-");
vector<string> v(z);
for (int i = a; i < z; i++)
v.push_back(i);
}
In your inner loop, you are pushing ints into a string vector. If you just single-step that at the machine-code level, I'll bet you find that a lot of that time goes into allocating and formatting the strings, and then some time goes into the pushback (not to mention deallocation when you release the vector).
This could easily vary between run-time-library implementations, based on the developer's sense of what people would reasonably want to do.

Categories

Resources