I have written two programs implementing a simple algorithm for matrix multiplication, one in C++ and one in Java. Contrary to my expectations, the Java program runs about 2.5x faster than the C++ program. I am a novice at C++, and would like suggestions on what I can change in the C++ program to make it run faster.
My programs borrow code and data from this blog post http://martin-thoma.com/matrix-multiplication-python-java-cpp .
Here are the current compilation flags I am using:
g++ -O3 main.cc
javac Main.java
Here are the current compiler/runtime versions:
$ g++ --version
g++.exe (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
My computer is a ~2012 era core i3 laptop running windows with MinGW. Here are the current performance results:
$ time ./a.exe < ../Testing/2000.in
507584919
real 0m36.469s
user 0m0.031s
sys 0m0.030s
$ time java Main < ../Testing/2000.in
507584919
real 0m14.299s
user 0m0.031s
sys 0m0.015s
Here is the C++ program:
#include <iostream>
#include <cstdio>
using namespace std;
int *A;
int *B;
int height;
int width;
int * matMult(int A[], int B[]) {
int * C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
int main() {
std::ios::sync_with_stdio(false);
cin >> height;
cin >> width;
A = new int[width*height];
B = new int[width*height];
for (int i = 0; i < width*height; i++) {
cin >> A[i];
}
for (int i = 0; i < width*height; i++) {
cin >> B[i];
}
int *result = matMult(A,B);
cout << result[2];
}
Here is the java program:
import java.util.*;
import java.io.*;
public class Main {
static int[] A;
static int[] B;
static int height;
static int width;
public static void main(String[] args) {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
height = Integer.parseInt(reader.readLine());
width = Integer.parseInt(reader.readLine());
A=new int[width*height];
B=new int[width*height];
int index = 0;
String thisLine;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
A[index] = Integer.parseInt(number);
index++;
}
}
}
index = 0;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
B[index] = Integer.parseInt(number);
index++;
}
}
}
int[] result = matMult(A,B);
System.out.println(result[2]);
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static int[] matMult(int[] A, int[] B) {
int[] C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
}
Here is a link to a 2000x2000 test case: https://mega.nz/#!sglWxZqb!HBts_UlZnR4X9gZR7bG-ej3xf2A5vUv0wTDUW-kqFMA
Here is a link to a 2x2 test case: https://mega.nz/#!QwkV2SII!AtfGuxPV5bQeZtt9eHNNn36rnV4sGq0_sJzitjiFE8s
Any advice explaining what I am doing wrong in C++, or why my C++ implementation is running so much slower than Java here, would be much appreciated!
EDIT: As suggested, I modified the programs so that they do not actually perform a multiplication, but just read the arrays in and print out one number from each. Here are the performance results for that. The C++ program has slower IO. That only accounts for part of the difference however.
$ time ./IOonly.exe < ../Testing/2000.in
7
944
real 0m8.158s
user 0m0.000s
sys 0m0.046s
$ time java IOOnly < ../Testing/2000.in
7
944
real 0m1.461s
user 0m0.000s
sys 0m0.047s
I'm not able to analyze the java execution, since it creates a temporary executable module that disappears after it's been "used". However, I assume that it does execute SSE instructions to get that speed [or that it unrolls the loop, which clang++ does if you disable SSE instructions]
But compiling with g++ (4.9.2) and clang++, I can clearly see that clang optimises the loop to use SSE instructions, where gcc doesn't. The resulting code is thus exactly 4 times slower. Changing the code so that it uses a constant value of 2000 in each dimension [so compiler "knows" the dimensions of the height and width], the gcc compiler also generates code that takes around 8s (on my machine!), compared to 27s with "variable" value [the clang compiled code is marginally faster as well here, but within the noise I'd say].
Overall conclusion: Quality/cleverness of compiler will highly affect the performance of tight loops. The more complex and varied the code is, the more likely it is that the C++ solution will generate better code, where simple and easy to compile problems are quite likely to be better in Java code [as a rule, but not guaranteed]. I expect the java compiler uses profiling to determine the number of loops for example.
Edit:
The result of time can be used to determine if the reading of the file is taking a long time, but you need some kind of profiling tool to determine if the actual input is using a lot of CPU-time and such.
The java engine uses a "just-in-time compiler", which uses profiling to determine the number of times a particular piece of code is hit (you can do that for C++ too, and big projects often do!), which allows it to for example unroll a loop, or determine at runtime the number of iterations in a loop. Given that this code does 2000 * 2000 * 2000 loops, and the C++ compiler actually does a BETTER job when it KNOWS the size of the values is telling us that the Java runtime isn't actually doing better (at least not initially), just that it manages to improve the performance over time.
Unfortunately, due to the way that the java runtime works, it doesn't leave the binary code behind, so I can't really analyze what it does.
The key here is that the actual operations you are doing are simple, and the logic is simple, it's just an awful lot of them, and you are doing them using a trivial implementation. Both Java and C++ will benefit from manually unrolling the loop, for example.
C++ is not faster than Java by default
C++ is fast as a language, but soon as you incorporate libraries into the mix, you are bound to these libraries' speed.
The standard is hardly built for performance, period. The standard libraries are written with design and correctness in mind.
C++ gives you the opportunity to optimize!
If you are unhappy with the standard library's performance, you can, and you should, use your own optimized version.
For example, standard C++ IO objects are beautiful when it comes to design (stream, locales, facets, inner buffers) but that makes them terrible at performance.
If you are writing for Windows OS, you can use ReadFile and WriteConsole as your mechanism for IO.
If you switch to these functions instead of the standard libraries - your program outperforms Java by a few orders of magnitude.
Related
I recently wrote a computation-intensive algorithm in Java, and then translated it to C++. To my surprise the C++ executed considerably slower. I have now written a much shorter Java test program, and a corresponding C++ program - see below. My original code featured a lot of array access, as does the test code. The C++ takes 5.5 times longer to execute (see comment at end of each program).
Conclusions after 1st 21 comments below ...
Test code:
g++ -o ... Java 5.5 times faster
g++ -O3 -o ... Java 2.9 times faster
g++ -fprofile-generate -march=native -O3 -o ... (run, then g++ -fprofile-use etc) Java 1.07 times faster.
My original project (much more complex than test code):
Java 1.8 times faster
C++ 1.9 times faster
C++ 2 times faster
Software environment:
Ubuntu 16.04 (64 bit).
Netbeans 8.2 / jdk 8u121 (java code executed inside netbeans)
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Compilation: g++ -o cpp_test cpp_test.cpp
Java code:
public class JavaTest {
public static void main(String[] args) {
final int ARRAY_LENGTH = 100;
final int FINISH_TRIGGER = 100000000;
int[] intArray = new int[ARRAY_LENGTH];
for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
int i = 0;
boolean finished = false;
long loopCount = 0;
System.out.println("Start");
long startTime = System.nanoTime();
while (!finished) {
loopCount++;
intArray[i]++;
if (intArray[i] >= FINISH_TRIGGER) finished = true;
else if (i <(ARRAY_LENGTH - 1)) i++;
else i = 0;
}
System.out.println("Finish: " + loopCount + " loops; " +
((System.nanoTime() - startTime)/1e9) + " secs");
// 5 executions in range 5.98 - 6.17 secs (each 9999999801 loops)
}
}
C++ code:
//cpp_test.cpp:
#include <iostream>
#include <sys/time.h>
int main() {
const int ARRAY_LENGTH = 100;
const int FINISH_TRIGGER = 100000000;
int *intArray = new int[ARRAY_LENGTH];
for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
int i = 0;
bool finished = false;
long long loopCount = 0;
std::cout << "Start\n";
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
long long startTime = (1000000000*ts.tv_sec) + ts.tv_nsec;
while (!finished) {
loopCount++;
intArray[i]++;
if (intArray[i] >= FINISH_TRIGGER) finished = true;
else if (i < (ARRAY_LENGTH - 1)) i++;
else i = 0;
}
clock_gettime(CLOCK_REALTIME, &ts);
double elapsedTime =
((1000000000*ts.tv_sec) + ts.tv_nsec - startTime)/1e9;
std::cout << "Finish: " << loopCount << " loops; ";
std::cout << elapsedTime << " secs\n";
// 5 executions in range 33.07 - 33.45 secs (each 9999999801 loops)
}
The only time I could get the C++ program to outperform Java was when using profiling information. This shows that there's something in the runtime information (that Java gets by default) that allows for faster execution.
There's not much going on in your program apart from a non-trivial if statement. That is, without analysing the entire program, it's hard to predict which branch is most likely. This leads me to believe that this is a branch misprediction issue. Modern CPUs do instruction pipelining which allows for higher CPU throughput. However, this requires a prediction of what the next instructions to execute are. If the guess is wrong, the instruction pipeline must be cleared out, and the correct instructions loaded in (which takes time).
At compile time, the compiler doesn't have enough information to predict which branch is most likely. CPUs do a bit of branch prediction as well, but this is generally along the lines of loops loop and ifs if (rather than else).
Java, however, has the advantage of being able to use information at runtime as well as compile time. This allows Java to identify the middle branch as the one that occurs most frequently and so have this branch predicted for the pipeline.
Somehow both GCC and clang fail to unroll this loop and pull out the invariants even in -O3 and -Os, but Java does.
Java's final JITted assembly code is similar to this (in reality repeated twice):
while (true) {
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) { if (i >= ARRAY_LENGTH) i = 0; break; }
if (i >= ARRAY_LENGTH) i = 0;
}
With this loop I'm getting exact same timings (6.4s) between C++ and Java.
Why is this legal to do? Because ARRAY_LENGTH is 100, which is a multiple of 4. So i can only exceed 100 and be reset to 0 every 4 iterations.
This looks like an opportunity for improvement for GCC and clang; they fail to unroll loops for which the total number of iterations is unknown, but even if unrolling is forced, they fail to recognize parts of the loop that apply to only certain iterations.
Regarding your findings in a more complex code (a.k.a. real life): Java's optimizer is exceptionally good for small loops, a lot of thought has been put into that, but Java loses a lot of time on virtual calls and GC.
In the end it comes down to machine instructions running on a concrete architecture, whoever comes up with the best set, wins. Don't assume the compiler will "do the right thing", look and the generated code, profile, repeat.
For example, if you restructure your loop just a bit:
while (!finished) {
for (i=0; i<ARRAY_LENGTH; ++i) {
loopCount++;
if (++intArray[i] >= FINISH_TRIGGER) {
finished=true;
break;
}
}
}
Then C++ will outperform Java (5.9s vs 6.4s). (revised C++ assembly)
And if you can allow a slight overrun (increment more intArray elements after reaching the exit condition):
while (!finished) {
for (int i=0; i<ARRAY_LENGTH; ++i) {
++intArray[i];
}
loopCount+=ARRAY_LENGTH;
for (int i=0; i<ARRAY_LENGTH; ++i) {
if (intArray[i] >= FINISH_TRIGGER) {
loopCount-=ARRAY_LENGTH-i-1;
finished=true;
break;
}
}
}
Now clang is able to vectorize the loop and reaches the speed of 3.5s vs. Java's 4.8s (GCC is unfortunately still not able to vectorize it).
I was trying to measure the time to execute this loop :
for (boolean t : test) {
if (!t)
++count;
}
And was getting inconsistent results. Eventually I have managed to get consistent results with the following code :
public class Test {
public static void main(String[] args) {
int size = 100;
boolean[] test = new boolean[10_000_000];
java.util.Random r = new java.util.Random();
for (int n = 0; n < 10_000_000; ++n)
test[n] = !r.nextBoolean();
int expected = 0;
long acumulated = 0;
for (int repeat = -1; repeat < size; ++repeat) {
int count = 0;
long start = System.currentTimeMillis();
for (boolean t : test) {
if (!t)
++count;
}
long end = System.currentTimeMillis();
if (repeat != -1) // First run does not count, VM warming up
acumulated += end - start;
else // Use count to avoid compiler or JVM
expected = count; //optimization of inner loop
if ( count!=expected )
throw new Error("Tests don't run same ammount of times");
}
float average = (float) acumulated / size;
System.out.println("1st test : " + average);
int expectedBis = 0;
acumulated = 0;
if ( "reassign".equals(args[0])) {
for (int n = 0; n < 10_000_000; ++n)
test[n] = test[n];
}
for (int repeat = -1; repeat < size; ++repeat) {
int count = 0;
long start = System.currentTimeMillis();
for (boolean t : test) {
if (!t)
++count;
}
long end = System.currentTimeMillis();
if (repeat != -1) // First run does not count, VM warming up
acumulated += end - start;
else // Use count to avoid compiler or JVM
expectedBis = count; //optimization of inner loop
if ( count!=expected || count!=expectedBis)
throw new Error("Tests don't run same ammount of times");
}
average = (float) acumulated / size;
System.out.println("2nd test : " + average);
}
}
The results I get are :
$ java -jar Test.jar noreassign
1st test : 23.98
2nd test : 23.97
$ java -jar Test.jar reassign
1st test : 23.98
2nd test : 40.86
$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (Gentoo package icedtea-7.2.5.5)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
The difference is in executing or not this loop before the 2nd test.
for (int n = 0; n < 10_000_000; ++n)
test[n] = test[n];
Why? Why does doing that reassignation cause those loops after it to take twice the time?
Getting profiling right is hard...
"As for why the JIT compiler causes such behaviour... that is beyond my skill and knowledge."
Three basic facts:
Code runs faster after JIT compilation.
JIT compilation is triggered after a chunk of code has run for a bit. (How long "a bit" is is influenced the JVM platform and command line options.)
JIT compilation takes time.
In your case, when you insert the big assignment loop between test 1 and test 2, you are most likely moving the time point at which JIT compilation is triggered ... from during test 2 to between the 2 tests.
The simple way address this in this case is to put the body of main into a loop and run it repeatedly. Then discard the anomalous results in the first few runs.
(Turning off JIT compilation is not a good answer. Normally, it is the performance characteristics of code after JIT compilation that is going to be indicative of how a real application performs ...)
By setting the compiler to NONE, you are disabling JIT compilation, taking it out of the equation.
This kind of anomaly is common when people attempt to write micro-benchmarks by hand. Read this Q&A:
How do I write a correct micro-benchmark in Java?
I would add this as a comment, but my reputation is too low, so it must be added as an answer.
I've created a jar with your exact code, and ran it several times. I also copied the code to C# and ran it in the .NET runtime as well.
Both Java and C# show the same exact time, with and without the 'reassign' loop.
What timing are you getting if you change the loop to
if ( "reassign".equals(args[0])) {
for (int n = 0; n < 5_000_000; ++n)
test[n] = test[n];
}
?
Marko Topolniks's and rossum's comments got me on the right direction.
It is a JIT compiler issue.
If I disable the JIT compiler I get these results :
$ java -jar Test.jar -Djava.compiler=NONE noreassign
1st test : 19.23
2nd test : 19.33
$ java -jar Test.jar -Djava.compiler=NONE reassign
1st test : 19.23
2nd test : 19.32
The strange slowdown dissapears once the JIT compiler is deactivated.
As for why the JIT compiler causes such behaviour... that is beyond my skill and knowledge.
But it does not happen in all JVMs as Marius Dornean's tests show.
Has anyone compared the processing power of mobile devices with PC? I have a very simple matrix work. Coded in Java, it takes ~115ms for my old PC to finish the work. THE VERY VERY SAME FUNCTION takes 17000 ms. I was very shocked. I didn't expect that the tablet would be close to PC - but I didn't expect it is ~150x slower either!!
Has anyone had a similar experience? Any suggestion? Does it help if I write the code in C and use Android NDK?
The benchmark code in Java:
package mainpackage;
import java.util.Date;
public class mainclass {
public static void main(String[] args){
Date startD = new Date();
double[][] testOut;
double[] v = {1,0,0};
double t;
for (int i = 0; i < 100000; i++) {
t=Math.random();
testOut=rot_mat(v, t);
}
Date endD = new Date();
System.out.println("Time Taken ms: "+(-startD.getTime()+endD.getTime()));
}
public static double[][] rot_mat(double v[], double t)
{
double absolute;
double x[] = new double[3];
double temp[][] = new double[3][3];
double temp_2[][] = new double[3][3];
double sum;
int i;
int k;
int j;
// Normalize the v matrix into k
absolute = abs_val_vec(v);
for (i = 0; i < 3; i++)
{
x[i] = v[i] / absolute;
}
// Create 3x3 matrix kx
double kx[][] = {{0, -x[2], x[1]},
{x[2], 0, -x[0]},
{-x[1], x[0], 0}};
// Calculate output
// Calculate third term in output
for (i = 0; i < 3; i++)
{
for (j = 0; j < 3; j++)
{
sum = 0;
for (k = 0; k < 3; k++)
{
sum = sum + kx[i][k] * kx[k][j];
}
temp[i][j] = (1-Math.cos(t))*sum;
}
}
// Calculate second term in output
for (i = 0; i < 3; i++)
{
for (k = 0; k < 3; k++)
{
temp_2[i][k] = Math.sin(t)*kx[i][k];
}
}
// Calculate output
double[][] resOut = new double[3][3];
for (i = 0; i < 3; i++)
{
for (k = 0; k < 3; k++)
{
resOut[i][k] = temp_2[i][k] + temp[i][k] + ((i==k)?1:0);
}
}
return resOut;
}
private static double abs_val_vec (double v[])
{
double output;
output = Math.sqrt(v[0]*v[0] + v[1]*v[1] + v[2]*v[2]);
return output;
}
}
Any suggestion?
Micro-benchmarks only measure the performance of the micro-benchmark. And, the only decent way to interpret micro-benchmarks is with micro-measurements. Hence, savvy programmers would use tools like Traceview to get a better sense of where their time is being taken.
I suspect that if you ran this through Traceview, and looked at LogCat, you would find that your time is being spent in two areas:
Memory allocation and garbage collection. Your micro-benchmark is chewing through ~3MB of heap space. In production code, you'd never do that, at least if you wanted to keep your job.
Floating-point operations. Depending upon your tablet, you may not have a floating-point co-processor, and doing floating-point math on the CPU sans a floating-point co-processor is very very slow.
Does it help if I write the code in C and use Android NDK?
Well, until you profile the code under Traceview, that will be difficult to answer. For example, if the time is mostly spent in sqrt(), cos(), and sin(), that already is native code, and you won't get much faster.
More importantly, even if this micro-benchmark might improve with native code, all that does is demonstrate that this micro-benchmark might improve with native code. For example, a C translation of this might be faster due to manual heap management (malloc() and free()) rather than garbage collection. But that is more an indictment of how poorly the micro-benchmark was written than it is a statement about how much faster C will be, as production Java code would be optimized better than this.
Beyond learning how to use Traceview, I suggest:
Reading the NDK documentation, as it includes information about when native code may make sense.
Reading up on Renderscript Compute. On some devices, using Renderscript Compute can offload integer math onto the GPU, for a massive performance boost. That would not help your floating-point micro-benchmark, but for other matrix calculations (e.g., image processing), Renderscript Compute may be well worth researching.
Processing power alone is not everything when you compare very different architectures. In fact, you're very likely not benchmarking the computing architectures alone.
A key factor in benchmarking. When you're dealing with something that takes a lot of variables into account, isolate the one you want to test, and keep others constant and preferably equal.
In your situation, some examples for variables that affect your result:
the actual computing architecture, which is a complex set of variables itself (processor design and implementation, memory hierarchy etc)
the OS
the different Java Virtual Machine implementation for the different variables above
the additional layers the Dalvik implies
There are at least eight sets of comparisons between PCs and Android devices for my numerous Android benchmarks in the following. Below are results from my Linpack benchmark (including Java) that show the Androids in a better light than your results. Other results (like Dhrystone) show that, on a per MHz basis, ARM’s CPUs can match Intel’s.
http://www.roylongbottom.org.uk/android%20benchmarks.htm
Linpack Benchmark Results
System ARM MHz Android Linpackv5 Linpackv7 LinpackSP NEONLinpack LinpackJava
See MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS
T1 926EJ 800 2.2 5.63 5.67 9.61 N/A 2.33
P4 v7-A8 800 2.3.5 80.18 28.34 #G
T2 v7-A9 800 2.3.4 10.56 101.39 129.05 255.77 33.36
P5 v7-A9 1500 4.0.3 171.39 50.87 #G
T4 v7-A9 1500a 4.0.3 16.86 155.52 204.61 382.46 56.89
T6 v7-A9 1600 4.0.3 196.47
T7 v7-A9 1300a 4.1.2 17.08 151.05 201.30 376.00 56.44
T9 926EJ 800 2.2 5.66
T11 v7-A15 2000b 4.2.2 28.82 459.17 803.04 1334.90 143.06
T12 v7-A9 1600 4.1.2 147.07
T14 v7-A9 1500 4.0.4 180.95
P11 v7-A9 1400 4.0.4 19.89 184.44 235.54 454.21 56.99
P10 QU-S4 1500 4.0.3 254.90
Measured MHz a=1200, b=1700
Atom 1666 Linux 204.09 215.73 117.81
Atom 1666 Windows 183.22 118.70
Atom 1666 And x86 15.65
Core 2 2400 Linux 1288.00 901.00
Core 2 2400 Windows 1315.29 551.00
Core 2 2400 And x86 53.27
System - T = Tablet, P = Phone, #G = GreenComputing, QU = Qualcomm CPU
And 86 = Android x86
My general experience with Java 7 tells me that it is faster than Java 6. However, I've run into enough information that makes me believe that this is not always the case.
The first bit of information comes from Minecraft Snooper data found here. My intention was to look at that data to determine the effects of the different switches used to launch Minecraft. For example I wanted to know if using -Xmx4096m had a negative or positive effect on performance. Before I could get there I looked at the different version of Java being used. It covers everything from 1.5 to a developer using 1.8. In general as you increase the java version you see an increase in fps performance. Throughout the different versions of 1.6 you even see this gradual trend up. I honestly wasn't expecting to see as many different versions of java still in the wild but I guess people don't run the updates like they should.
Some time around the later versions of 1.6 you get the highest peeks. 1.7 performs about 10fps on average below the later versions of 1.6 but still higher than the early versions of 1.6. On a sample from my own system it's almost impossible to see the difference but when looking at the broader sample it's clear.
To control for the possibility that someone might have found a magic switch for Java I control with by only looking at the data with No switches being passed. That way I'd have a reasonable control before I started looking at the different flags.
I dismissed most of what I was seeing as this could be some Magic Java 6 that someone's just not sharing with me.
Now I've been working on another project that requires me to pass an array in an InputStream to be processed by another API. Initially I used a ByteArrayInputStream because it would work out of the box. When I looked at the code for it I noticed that every function was synchronized. Since this was unnecessary for this project I rewrote one with the synchronization stripped out. I then decided that I wanted to know what the general cost of Synchronization was for me in this situation.
I mocked up a simple test just to see. I timed everything in with System.nanoTime() and used Java 1.6_20 x86 and 1.7.0-b147 AMD64, and 1.7_15 AMD64 and using the -server. I expected the AMD64 version to outperform based on architecture alone and have any java 7 advantages. I also looked at the 25th, 50th, and 75th percentile (blue,red,green). However 1.6 with no -server beat the pants off of every other configuration.
So my question is.
What is in the 1.6 -server option that is impacting performance that is also defaulted to on in 1.7?
I know most of the speed enhancement in 1.7 came from defaulting some of the more radical performance options in 1.6 to on, but one of them is causing a performance difference. I just don't know which ones to look at.
public class ByteInputStream extends InputStream {
public static void main(String args[]) throws IOException {
String song = "This is the song that never ends";
byte[] data = song.getBytes();
byte[] read = new byte[data.length];
ByteArrayInputStream bais = new ByteArrayInputStream(data);
ByteInputStream bis = new ByteInputStream(data);
long startTime, endTime;
for (int i = 0; i < 10; i++) {
/*code for ByteInputStream*/
/*
startTime = System.nanoTime();
for (int ctr = 0; ctr < 1000; ctr++) {
bis.mark(0);
bis.read(read);
bis.reset();
}
endTime = System.nanoTime();
System.out.println(endTime - startTime);
*/
/*code for ByteArrayInputStream*/
startTime = System.nanoTime();
for (int ctr = 0; ctr < 1000; ctr++) {
bais.mark(0);
bais.read(read);
bais.reset();
}
endTime = System.nanoTime();
System.out.println(endTime - startTime);
}
}
private final byte[] array;
private int pos;
private int min;
private int max;
private int mark;
public ByteInputStream(byte[] array) {
this(array, 0, array.length);
}
public ByteInputStream(byte[] array, int offset, int length) {
min = offset;
max = offset + length;
this.array = array;
pos = offset;
}
#Override
public int available() {
return max - pos;
}
#Override
public boolean markSupported() {
return true;
}
#Override
public void mark(int limit) {
mark = pos;
}
#Override
public void reset() {
pos = mark;
}
#Override
public long skip(long n) {
pos += n;
if (pos > max) {
pos = max;
}
return pos;
}
#Override
public int read() throws IOException {
if (pos >= max) {
return -1;
}
return array[pos++] & 0xFF;
}
#Override
public int read(byte b[], int off, int len) {
if (pos >= max) {
return -1;
}
if (pos + len > max) {
len = max - pos;
}
if (len <= 0) {
return 0;
}
System.arraycopy(array, pos, b, off, len);
pos += len;
return len;
}
#Override
public void close() throws IOException {
}
}// end class
I think, as the others are saying, that your tests are too short to see the core issues - the graph is showing nanoTime, and that implies the core section being measured completes in 0.0001 to 0.0006s.
Discussion
The key difference in -server and -client is that -server expects the JVM to be around for a long time and therefore expends effort early on for better long-term results. -client aims for fast startup times and good-enough performance.
In particular hotspot runs with more optimizations, and these take more CPU to execute. In other words, with -server, you may be seeing the cost of the optimizer outweighing any gains from the optimization.
See Real differences between "java -server" and "java -client"?
Alternatively, you may also be seeing the effects of tiered compilation where, in Java 7, hotspot doesn't kick in so fast. With only 1000 iterations, the full optimization of your code won't be done until later, and the benefits will therefore be lesser.
You might get insight if you run java with the -Xprof option the JVM will dump some data about the time spent in various methods, both interpreted and compiled. It should give an idea about what was compiled, and the ratio of (cpu) time before hotspot kicked in.
However, to get a true picture, you really need to run this much longer - secondsminutes, not milliseconds - to allow Java and the OS to warm up. It would be even better to loop the test in main (so you have a loop containing your instrumented main test loop) so that you can ignore the warm-up.
EDIT Changed seconds to minutes to ensure that hotspot, the jvm and the OS are properly 'warmed up'
I was playing with Go language concurrency and found something which is kinda opaque to me.
I wrote parallel matrix multiplication, that is, each task computes single line of product matrix, multiplying corresponding rows and columns of source matrices.
Here is Java program
public static double[][] parallelMultiply(int nthreads, final double[][] m1, final double[][] m2) {
final int n = m1.length, m = m1[0].length, l = m2[0].length;
assert m1[0].length == m2.length;
double[][] r = new double[n][];
ExecutorService e = Executors.newFixedThreadPool(nthreads);
List<Future<double[]>> results = new LinkedList<Future<double[]>>();
for (int ii = 0; ii < n; ++ii) {
final int i = ii;
Future<double[]> result = e.submit(new Callable<double[]>() {
public double[] call() throws Exception {
double[] row = new double[l];
for (int j = 0; j < l; ++j) {
for (int k = 0; k < m; ++k) {
row[j] += m1[i][k]*m2[k][j];
}
}
return row;
}
});
results.add(result);
}
try {
e.shutdown();
e.awaitTermination(1, TimeUnit.HOURS);
int i = 0;
for (Future<double[]> result : results) {
r[i] = result.get();
++i;
}
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
return r;
}
and this is Go program
type Matrix struct {
n, m int
data [][]float64
}
func New(n, m int) *Matrix {
data := make([][]float64, n)
for i, _ := range data {
data[i] = make([]float64, m)
}
return &Matrix{n, m, data}
}
func (m *Matrix) Get(i, j int) float64 {
return m.data[i][j]
}
func (m *Matrix) Set(i, j int, v float64) {
m.data[i][j] = v
}
func MultiplyParallel(m1, m2 *Matrix) *Matrix {
r := New(m1.n, m2.m)
c := make(chan interface{}, m1.n)
for i := 0; i < m1.n; i++ {
go func(i int) {
innerLoop(r, m1, m2, i)
c <- nil
}(i)
}
for i := 0; i < m1.n; i++ {
<-c
}
return r
}
func innerLoop(r, m1, m2 *Matrix, i int) {
for j := 0; j < m2.m; j++ {
s := 0.0
for k := 0; k < m1.m; k++ {
s = s + m1.Get(i, k) * m2.Get(k, j)
}
r.Set(i, j, s)
}
}
When I use Java program with nthreads=1 and nthreads=2 there is nearly double speedup on my dual-core N450 Atom netbook.
When I use Go program with GOMAXPROCS=1 and GOMAXPROCS=2 there is no speedup at all!
Even though Java code uses additional storage for Futures and then collectes their values to the result matrix instead of direct array update in the worker code (that's what Go version does), it performs much more faster on several cores than Go version.
Especially funny is that Go version with GOMAXPROCS=2 loads both cores (htop displays 100% load on both processors while program works), but the time of computation is the same as with GOMAXPROCS=1 (htop displays 100% load only on one core in this case).
Another concern is that Java program is faster than Go one even in simple single-thread multiplication, but that is not exactly unexpected (taking benchmarks from here into account) and should not affect multicore performance multiplier.
What I'm doing incorrectly here? Is there a way to speedup Go program?
UPD:
it seems i found what I'm doing incorrectly. I was checking time of java program using System.currentTimeMillis() and Go program using time shell command. I mistakingly took 'user' time from zsh output as program working time instead of 'total' one. Now i double-checked the computation speed and it gives me nearly double speedup too (though it is slighlty lesser than Java's):
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q
env GOMAXPROCS=2 ./4-2-go -n 500 -q 22,34s user 0,04s system 99% cpu 22,483 total
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q -p
env GOMAXPROCS=2 ./4-2-go -n 500 -q -p 24,09s user 0,10s system 184% cpu 13,080 total
Seems I have to be more attentive.
Still java program gives five time lesser times on the same case. But it is a matter for another question I think.
You are probably experiencing the effects of false sharing. In a nutshell, if two pieces of data happen to fall onto the same CPU cache line, modifying these two pieces of data from threads that execute on different CPU cores will trigger the expensive cache coherency protocol.
This kind of cache "ping-pong" is extremely hard to diagnose, and can happen on logically completely unrelated data, just because they happen to be placed close enough in memory. The 100% CPU load is typical of false sharing - your cores really are working 100%, they are just not working on your program - they are working on synchronizing their caches.
The fact that in Java program you have a thread-private data until the time comes to "integrate" it into the final result is what saves you from false sharing. I'm not familiar with Go, but judging on your own words, threads are writing directly to the common array, which is exactly the kind of thing that could trigger the false sharing. This is an example how a perfectly valid single-threaded reasoning does exactly the opposite in the multi-threaded environment!
For more in-depth discussion on the topic, I warmly recommend Herb Sutter's article: Eliminate False Sharing, or a lecture: Machine Architecture: Things Your Programming Language Never Told You (and associated PDF slides).
If you are able to run these code in Linux environment you can use perf to measure the false sharing effect.
For Linux, Windows 32 and ditto 64 there are also AMD's CodeXL and CodeAnalyst. They will profile an application running on an AMD processor in much greater detail than one from intel since the applicable performance registers are different.