No speedup in multithread program

No speedup in multithread program - java

I was playing with Go language concurrency and found something which is kinda opaque to me.
I wrote parallel matrix multiplication, that is, each task computes single line of product matrix, multiplying corresponding rows and columns of source matrices.
Here is Java program
public static double[][] parallelMultiply(int nthreads, final double[][] m1, final double[][] m2) {
final int n = m1.length, m = m1[0].length, l = m2[0].length;
assert m1[0].length == m2.length;
double[][] r = new double[n][];
ExecutorService e = Executors.newFixedThreadPool(nthreads);
List<Future<double[]>> results = new LinkedList<Future<double[]>>();
for (int ii = 0; ii < n; ++ii) {
final int i = ii;
Future<double[]> result = e.submit(new Callable<double[]>() {
public double[] call() throws Exception {
double[] row = new double[l];
for (int j = 0; j < l; ++j) {
for (int k = 0; k < m; ++k) {
row[j] += m1[i][k]*m2[k][j];
}
}
return row;
}
});
results.add(result);
}
try {
e.shutdown();
e.awaitTermination(1, TimeUnit.HOURS);
int i = 0;
for (Future<double[]> result : results) {
r[i] = result.get();
++i;
}
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
return r;
}
and this is Go program
type Matrix struct {
n, m int
data [][]float64
}
func New(n, m int) *Matrix {
data := make([][]float64, n)
for i, _ := range data {
data[i] = make([]float64, m)
}
return &Matrix{n, m, data}
}
func (m *Matrix) Get(i, j int) float64 {
return m.data[i][j]
}
func (m *Matrix) Set(i, j int, v float64) {
m.data[i][j] = v
}
func MultiplyParallel(m1, m2 *Matrix) *Matrix {
r := New(m1.n, m2.m)
c := make(chan interface{}, m1.n)
for i := 0; i < m1.n; i++ {
go func(i int) {
innerLoop(r, m1, m2, i)
c <- nil
}(i)
}
for i := 0; i < m1.n; i++ {
<-c
}
return r
}
func innerLoop(r, m1, m2 *Matrix, i int) {
for j := 0; j < m2.m; j++ {
s := 0.0
for k := 0; k < m1.m; k++ {
s = s + m1.Get(i, k) * m2.Get(k, j)
}
r.Set(i, j, s)
}
}
When I use Java program with nthreads=1 and nthreads=2 there is nearly double speedup on my dual-core N450 Atom netbook.
When I use Go program with GOMAXPROCS=1 and GOMAXPROCS=2 there is no speedup at all!
Even though Java code uses additional storage for Futures and then collectes their values to the result matrix instead of direct array update in the worker code (that's what Go version does), it performs much more faster on several cores than Go version.
Especially funny is that Go version with GOMAXPROCS=2 loads both cores (htop displays 100% load on both processors while program works), but the time of computation is the same as with GOMAXPROCS=1 (htop displays 100% load only on one core in this case).
Another concern is that Java program is faster than Go one even in simple single-thread multiplication, but that is not exactly unexpected (taking benchmarks from here into account) and should not affect multicore performance multiplier.
What I'm doing incorrectly here? Is there a way to speedup Go program?
UPD:
it seems i found what I'm doing incorrectly. I was checking time of java program using System.currentTimeMillis() and Go program using time shell command. I mistakingly took 'user' time from zsh output as program working time instead of 'total' one. Now i double-checked the computation speed and it gives me nearly double speedup too (though it is slighlty lesser than Java's):
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q
env GOMAXPROCS=2 ./4-2-go -n 500 -q 22,34s user 0,04s system 99% cpu 22,483 total
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q -p
env GOMAXPROCS=2 ./4-2-go -n 500 -q -p 24,09s user 0,10s system 184% cpu 13,080 total
Seems I have to be more attentive.
Still java program gives five time lesser times on the same case. But it is a matter for another question I think.

You are probably experiencing the effects of false sharing. In a nutshell, if two pieces of data happen to fall onto the same CPU cache line, modifying these two pieces of data from threads that execute on different CPU cores will trigger the expensive cache coherency protocol.
This kind of cache "ping-pong" is extremely hard to diagnose, and can happen on logically completely unrelated data, just because they happen to be placed close enough in memory. The 100% CPU load is typical of false sharing - your cores really are working 100%, they are just not working on your program - they are working on synchronizing their caches.
The fact that in Java program you have a thread-private data until the time comes to "integrate" it into the final result is what saves you from false sharing. I'm not familiar with Go, but judging on your own words, threads are writing directly to the common array, which is exactly the kind of thing that could trigger the false sharing. This is an example how a perfectly valid single-threaded reasoning does exactly the opposite in the multi-threaded environment!
For more in-depth discussion on the topic, I warmly recommend Herb Sutter's article: Eliminate False Sharing, or a lecture: Machine Architecture: Things Your Programming Language Never Told You (and associated PDF slides).

If you are able to run these code in Linux environment you can use perf to measure the false sharing effect.

For Linux, Windows 32 and ditto 64 there are also AMD's CodeXL and CodeAnalyst. They will profile an application running on an AMD processor in much greater detail than one from intel since the applicable performance registers are different.

Related

Time how long a function runs (short duration)

I'm relatively new to Java programming, and I'm running into an issue calculating the amount of time it takes for a function to run.
First some background - I've got a lot of experience with Python, and I'm trying to recreate the functionality of the Jupyter Notebook/Lab %%timeit function, if you're familiar with that. Here's a pic of it in action (sorry, not enough karma to embed yet):
Snip of Jupyter %%timeit
What it does is run the contents of the cell (in this case a recursive function) either 1k, 10k, or 100k times, and give you the average run time of the function, and the standard deviation.
My first implementation (using the same recursive function) used System.nanoTime():
public static void main(String[] args) {
long t1, t2, diff;
long[] times = new long[1000];
int t;
for (int i=0; i< 1000; i++) {
t1 = System.nanoTime();
t = triangle(20);
t2 = System.nanoTime();
diff = t2-t1;
System.out.println(diff);
times[i] = diff;
}
long total = 0;
for (int j=0; j<times.length; j++) {
total += times[j];
}
System.out.println("Mean = " + total/1000.0);
}
But the mean is wildly thrown off -- for some reason, the first iteration of the function (on many runs) takes upwards of a million nanoseconds:
Pic of initial terminal output
Every iteration after the first dozen or so takes either 395 nanos or 0 -- so there could be a problem there too... not sure what's going on!
Also -- the code of the recursive function I'm timing:
static int triangle(int n) {
if (n == 1) {
return n;
} else {
return n + triangle(n -1);
}
}
Initially I had the line n = Math.abs(n) on the first line of the function, but then I removed it because... meh. I'm the only one using this.
I tried a number of different suggestions brought up in this SO post, but they each have their own problems... which I can go into if you need.
Anyway, thank you in advance for your help and expertise!

Why is this C++ code execution so slow compared to java?

I recently wrote a computation-intensive algorithm in Java, and then translated it to C++. To my surprise the C++ executed considerably slower. I have now written a much shorter Java test program, and a corresponding C++ program - see below. My original code featured a lot of array access, as does the test code. The C++ takes 5.5 times longer to execute (see comment at end of each program).
Conclusions after 1st 21 comments below ...
Test code:
g++ -o ... Java 5.5 times faster
g++ -O3 -o ... Java 2.9 times faster
g++ -fprofile-generate -march=native -O3 -o ... (run, then g++ -fprofile-use etc) Java 1.07 times faster.
My original project (much more complex than test code):
Java 1.8 times faster
C++ 1.9 times faster
C++ 2 times faster
Software environment:
Ubuntu 16.04 (64 bit).
Netbeans 8.2 / jdk 8u121 (java code executed inside netbeans)
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Compilation: g++ -o cpp_test cpp_test.cpp
Java code:
public class JavaTest {
public static void main(String[] args) {
final int ARRAY_LENGTH = 100;
final int FINISH_TRIGGER = 100000000;
int[] intArray = new int[ARRAY_LENGTH];
for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
int i = 0;
boolean finished = false;
long loopCount = 0;
System.out.println("Start");
long startTime = System.nanoTime();
while (!finished) {
loopCount++;
intArray[i]++;
if (intArray[i] >= FINISH_TRIGGER) finished = true;
else if (i <(ARRAY_LENGTH - 1)) i++;
else i = 0;
}
System.out.println("Finish: " + loopCount + " loops; " +
((System.nanoTime() - startTime)/1e9) + " secs");
// 5 executions in range 5.98 - 6.17 secs (each 9999999801 loops)
}
}
C++ code:
//cpp_test.cpp:
#include <iostream>
#include <sys/time.h>
int main() {
const int ARRAY_LENGTH = 100;
const int FINISH_TRIGGER = 100000000;
int *intArray = new int[ARRAY_LENGTH];
for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
int i = 0;
bool finished = false;
long long loopCount = 0;
std::cout << "Start\n";
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
long long startTime = (1000000000*ts.tv_sec) + ts.tv_nsec;
while (!finished) {
loopCount++;
intArray[i]++;
if (intArray[i] >= FINISH_TRIGGER) finished = true;
else if (i < (ARRAY_LENGTH - 1)) i++;
else i = 0;
}
clock_gettime(CLOCK_REALTIME, &ts);
double elapsedTime =
((1000000000*ts.tv_sec) + ts.tv_nsec - startTime)/1e9;
std::cout << "Finish: " << loopCount << " loops; ";
std::cout << elapsedTime << " secs\n";
// 5 executions in range 33.07 - 33.45 secs (each 9999999801 loops)
}

The only time I could get the C++ program to outperform Java was when using profiling information. This shows that there's something in the runtime information (that Java gets by default) that allows for faster execution.
There's not much going on in your program apart from a non-trivial if statement. That is, without analysing the entire program, it's hard to predict which branch is most likely. This leads me to believe that this is a branch misprediction issue. Modern CPUs do instruction pipelining which allows for higher CPU throughput. However, this requires a prediction of what the next instructions to execute are. If the guess is wrong, the instruction pipeline must be cleared out, and the correct instructions loaded in (which takes time).
At compile time, the compiler doesn't have enough information to predict which branch is most likely. CPUs do a bit of branch prediction as well, but this is generally along the lines of loops loop and ifs if (rather than else).
Java, however, has the advantage of being able to use information at runtime as well as compile time. This allows Java to identify the middle branch as the one that occurs most frequently and so have this branch predicted for the pipeline.

Somehow both GCC and clang fail to unroll this loop and pull out the invariants even in -O3 and -Os, but Java does.
Java's final JITted assembly code is similar to this (in reality repeated twice):
while (true) {
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) break;
loopCount++;
if (++intArray[i++] >= FINISH_TRIGGER) { if (i >= ARRAY_LENGTH) i = 0; break; }
if (i >= ARRAY_LENGTH) i = 0;
}
With this loop I'm getting exact same timings (6.4s) between C++ and Java.
Why is this legal to do? Because ARRAY_LENGTH is 100, which is a multiple of 4. So i can only exceed 100 and be reset to 0 every 4 iterations.
This looks like an opportunity for improvement for GCC and clang; they fail to unroll loops for which the total number of iterations is unknown, but even if unrolling is forced, they fail to recognize parts of the loop that apply to only certain iterations.
Regarding your findings in a more complex code (a.k.a. real life): Java's optimizer is exceptionally good for small loops, a lot of thought has been put into that, but Java loses a lot of time on virtual calls and GC.
In the end it comes down to machine instructions running on a concrete architecture, whoever comes up with the best set, wins. Don't assume the compiler will "do the right thing", look and the generated code, profile, repeat.
For example, if you restructure your loop just a bit:
while (!finished) {
for (i=0; i<ARRAY_LENGTH; ++i) {
loopCount++;
if (++intArray[i] >= FINISH_TRIGGER) {
finished=true;
break;
}
}
}
Then C++ will outperform Java (5.9s vs 6.4s). (revised C++ assembly)
And if you can allow a slight overrun (increment more intArray elements after reaching the exit condition):
while (!finished) {
for (int i=0; i<ARRAY_LENGTH; ++i) {
++intArray[i];
}
loopCount+=ARRAY_LENGTH;
for (int i=0; i<ARRAY_LENGTH; ++i) {
if (intArray[i] >= FINISH_TRIGGER) {
loopCount-=ARRAY_LENGTH-i-1;
finished=true;
break;
}
}
}
Now clang is able to vectorize the loop and reaches the speed of 3.5s vs. Java's 4.8s (GCC is unfortunately still not able to vectorize it).

Performance optimization: C++ vs Java not performing as expected

I have written two programs implementing a simple algorithm for matrix multiplication, one in C++ and one in Java. Contrary to my expectations, the Java program runs about 2.5x faster than the C++ program. I am a novice at C++, and would like suggestions on what I can change in the C++ program to make it run faster.
My programs borrow code and data from this blog post http://martin-thoma.com/matrix-multiplication-python-java-cpp .
Here are the current compilation flags I am using:
g++ -O3 main.cc
javac Main.java
Here are the current compiler/runtime versions:
$ g++ --version
g++.exe (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
My computer is a ~2012 era core i3 laptop running windows with MinGW. Here are the current performance results:
$ time ./a.exe < ../Testing/2000.in
507584919
real 0m36.469s
user 0m0.031s
sys 0m0.030s
$ time java Main < ../Testing/2000.in
507584919
real 0m14.299s
user 0m0.031s
sys 0m0.015s
Here is the C++ program:
#include <iostream>
#include <cstdio>
using namespace std;
int *A;
int *B;
int height;
int width;
int * matMult(int A[], int B[]) {
int * C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
int main() {
std::ios::sync_with_stdio(false);
cin >> height;
cin >> width;
A = new int[width*height];
B = new int[width*height];
for (int i = 0; i < width*height; i++) {
cin >> A[i];
}
for (int i = 0; i < width*height; i++) {
cin >> B[i];
}
int *result = matMult(A,B);
cout << result[2];
}
Here is the java program:
import java.util.*;
import java.io.*;
public class Main {
static int[] A;
static int[] B;
static int height;
static int width;
public static void main(String[] args) {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
height = Integer.parseInt(reader.readLine());
width = Integer.parseInt(reader.readLine());
A=new int[width*height];
B=new int[width*height];
int index = 0;
String thisLine;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
A[index] = Integer.parseInt(number);
index++;
}
}
}
index = 0;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
B[index] = Integer.parseInt(number);
index++;
}
}
}
int[] result = matMult(A,B);
System.out.println(result[2]);
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static int[] matMult(int[] A, int[] B) {
int[] C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
}
Here is a link to a 2000x2000 test case: https://mega.nz/#!sglWxZqb!HBts_UlZnR4X9gZR7bG-ej3xf2A5vUv0wTDUW-kqFMA
Here is a link to a 2x2 test case: https://mega.nz/#!QwkV2SII!AtfGuxPV5bQeZtt9eHNNn36rnV4sGq0_sJzitjiFE8s
Any advice explaining what I am doing wrong in C++, or why my C++ implementation is running so much slower than Java here, would be much appreciated!
EDIT: As suggested, I modified the programs so that they do not actually perform a multiplication, but just read the arrays in and print out one number from each. Here are the performance results for that. The C++ program has slower IO. That only accounts for part of the difference however.
$ time ./IOonly.exe < ../Testing/2000.in
7
944
real 0m8.158s
user 0m0.000s
sys 0m0.046s
$ time java IOOnly < ../Testing/2000.in
7
944
real 0m1.461s
user 0m0.000s
sys 0m0.047s

I'm not able to analyze the java execution, since it creates a temporary executable module that disappears after it's been "used". However, I assume that it does execute SSE instructions to get that speed [or that it unrolls the loop, which clang++ does if you disable SSE instructions]
But compiling with g++ (4.9.2) and clang++, I can clearly see that clang optimises the loop to use SSE instructions, where gcc doesn't. The resulting code is thus exactly 4 times slower. Changing the code so that it uses a constant value of 2000 in each dimension [so compiler "knows" the dimensions of the height and width], the gcc compiler also generates code that takes around 8s (on my machine!), compared to 27s with "variable" value [the clang compiled code is marginally faster as well here, but within the noise I'd say].
Overall conclusion: Quality/cleverness of compiler will highly affect the performance of tight loops. The more complex and varied the code is, the more likely it is that the C++ solution will generate better code, where simple and easy to compile problems are quite likely to be better in Java code [as a rule, but not guaranteed]. I expect the java compiler uses profiling to determine the number of loops for example.
Edit:
The result of time can be used to determine if the reading of the file is taking a long time, but you need some kind of profiling tool to determine if the actual input is using a lot of CPU-time and such.
The java engine uses a "just-in-time compiler", which uses profiling to determine the number of times a particular piece of code is hit (you can do that for C++ too, and big projects often do!), which allows it to for example unroll a loop, or determine at runtime the number of iterations in a loop. Given that this code does 2000 * 2000 * 2000 loops, and the C++ compiler actually does a BETTER job when it KNOWS the size of the values is telling us that the Java runtime isn't actually doing better (at least not initially), just that it manages to improve the performance over time.
Unfortunately, due to the way that the java runtime works, it doesn't leave the binary code behind, so I can't really analyze what it does.
The key here is that the actual operations you are doing are simple, and the logic is simple, it's just an awful lot of them, and you are doing them using a trivial implementation. Both Java and C++ will benefit from manually unrolling the loop, for example.

C++ is not faster than Java by default
C++ is fast as a language, but soon as you incorporate libraries into the mix, you are bound to these libraries' speed.
The standard is hardly built for performance, period. The standard libraries are written with design and correctness in mind.
C++ gives you the opportunity to optimize!
If you are unhappy with the standard library's performance, you can, and you should, use your own optimized version.
For example, standard C++ IO objects are beautiful when it comes to design (stream, locales, facets, inner buffers) but that makes them terrible at performance.
If you are writing for Windows OS, you can use ReadFile and WriteConsole as your mechanism for IO.
If you switch to these functions instead of the standard libraries - your program outperforms Java by a few orders of magnitude.

Create a task that takes up a specific amount of CPU and RAM

I'm testing my company software project and would like to see how it works under heavy load conditions.
Is there anyway to create a task that takes up a big amount of CPU and only stops if I tell it to?
If it's not programmatically possible, what are other options? E.g. what software and input out there can quickly help me create such condition?
Thanks in advance.

This could be done by program.
For consuming CPU:
A simple dead loop won't consume all the CPUs, because your CPU probably have multiple logical cores, so you need create multiple threads to do it. Here is the code:
DWORD WINAPI ConsumeSingleCore(LPVOID lpThreadParameter)
{
DWORD_PTR mask = 1 << (int) lpThreadParameter;
::SetThreadAffinityMask(::GetCurrentThread(), mask);
for (;;) {}
}
void ConsumeAllCores()
{
SYSTEM_INFO systemInfo = { 0 };
::GetSystemInfo(&systemInfo);
for (DWORD i = 0; i < systemInfo.dwNumberOfProcessors; ++i)
{
::CreateThread(NULL, 0, ConsumeSingleCore, (LPVOID)i, 0, NULL);
}
}
For consuming memory:
Allocating enough objects on the heap will be helpful, although it is not very accurate, because there will be some overhead caused by internal structures in system, like heap. If you need accurate number. I think using virtual memory directly will be a good choice. Here is the code:
void ConsumeRAM()
{
SYSTEM_INFO systemInfo = { 0 };
::GetSystemInfo(&systemInfo);
DWORD memSize = 1024 * 1024 * 1024;
char *buffer = (char *)::VirtualAlloc(NULL, memSize, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
// Touch all the pages, so system will try to allocate physical memory for them.
for (DWORD memAddrOffset = 0; memAddrOffset < memSize; memAddrOffset += systemInfo.dwPageSize)
{
buffer[memAddrOffset] = 0;
}
return;
}
And if you just need some tools to test, you could try CPU overload for consuming certain number of cores and MemAlloc for consuming certain number of memory.

Any simple loop will consume CPU, typically processes yield when they are waiting for the OS to do something, read disk, access network, allocate memory.
int quiteBig = 20000;
int a[quiteBig] ;
while ( true ) { // or check for a terminating condition
for ( long i = 0; i < quiteBig ; i++ ) {
a[i] = i;
}
}

In C#:
For RAM
List<anyObject> hogger = new List<anyObject>();
for(long i = 0; i < someHugeNumber; i++)
hogger.Add(new anyObject());
Basically try and see how many objects you need to take up the space you want.
For CPU
while(true)
;
That is an instant 100% CPU
If you want a bit less
while(true)
{
for(long i = 0; i < someNumber; i++)
{
;
}
Thread.Sleep(1);
}
with some number you can adjust the number of mad cycles before the break
=> Fiddle with the number for averaging down CPU usage
As termination for the while-loops you can use a button on a form or a key-combination or something.

Yes it is programmatically possible, as others have shown examples on how to load the CPU. Note that the CPU load generally only takes one core so you will need have a copy of the program running for each core (assign a script to each core and specify which core it will run on)
You could also look into https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing

Processing Power - Mobile Devices vs Desktop - 100x difference?

Has anyone compared the processing power of mobile devices with PC? I have a very simple matrix work. Coded in Java, it takes ~115ms for my old PC to finish the work. THE VERY VERY SAME FUNCTION takes 17000 ms. I was very shocked. I didn't expect that the tablet would be close to PC - but I didn't expect it is ~150x slower either!!
Has anyone had a similar experience? Any suggestion? Does it help if I write the code in C and use Android NDK?
The benchmark code in Java:
package mainpackage;
import java.util.Date;
public class mainclass {
public static void main(String[] args){
Date startD = new Date();
double[][] testOut;
double[] v = {1,0,0};
double t;
for (int i = 0; i < 100000; i++) {
t=Math.random();
testOut=rot_mat(v, t);
}
Date endD = new Date();
System.out.println("Time Taken ms: "+(-startD.getTime()+endD.getTime()));
}
public static double[][] rot_mat(double v[], double t)
{
double absolute;
double x[] = new double[3];
double temp[][] = new double[3][3];
double temp_2[][] = new double[3][3];
double sum;
int i;
int k;
int j;
// Normalize the v matrix into k
absolute = abs_val_vec(v);
for (i = 0; i < 3; i++)
{
x[i] = v[i] / absolute;
}
// Create 3x3 matrix kx
double kx[][] = {{0, -x[2], x[1]},
{x[2], 0, -x[0]},
{-x[1], x[0], 0}};
// Calculate output
// Calculate third term in output
for (i = 0; i < 3; i++)
{
for (j = 0; j < 3; j++)
{
sum = 0;
for (k = 0; k < 3; k++)
{
sum = sum + kx[i][k] * kx[k][j];
}
temp[i][j] = (1-Math.cos(t))*sum;
}
}
// Calculate second term in output
for (i = 0; i < 3; i++)
{
for (k = 0; k < 3; k++)
{
temp_2[i][k] = Math.sin(t)*kx[i][k];
}
}
// Calculate output
double[][] resOut = new double[3][3];
for (i = 0; i < 3; i++)
{
for (k = 0; k < 3; k++)
{
resOut[i][k] = temp_2[i][k] + temp[i][k] + ((i==k)?1:0);
}
}
return resOut;
}
private static double abs_val_vec (double v[])
{
double output;
output = Math.sqrt(v[0]*v[0] + v[1]*v[1] + v[2]*v[2]);
return output;
}
}

Any suggestion?
Micro-benchmarks only measure the performance of the micro-benchmark. And, the only decent way to interpret micro-benchmarks is with micro-measurements. Hence, savvy programmers would use tools like Traceview to get a better sense of where their time is being taken.
I suspect that if you ran this through Traceview, and looked at LogCat, you would find that your time is being spent in two areas:
Memory allocation and garbage collection. Your micro-benchmark is chewing through ~3MB of heap space. In production code, you'd never do that, at least if you wanted to keep your job.
Floating-point operations. Depending upon your tablet, you may not have a floating-point co-processor, and doing floating-point math on the CPU sans a floating-point co-processor is very very slow.
Does it help if I write the code in C and use Android NDK?
Well, until you profile the code under Traceview, that will be difficult to answer. For example, if the time is mostly spent in sqrt(), cos(), and sin(), that already is native code, and you won't get much faster.
More importantly, even if this micro-benchmark might improve with native code, all that does is demonstrate that this micro-benchmark might improve with native code. For example, a C translation of this might be faster due to manual heap management (malloc() and free()) rather than garbage collection. But that is more an indictment of how poorly the micro-benchmark was written than it is a statement about how much faster C will be, as production Java code would be optimized better than this.
Beyond learning how to use Traceview, I suggest:
Reading the NDK documentation, as it includes information about when native code may make sense.
Reading up on Renderscript Compute. On some devices, using Renderscript Compute can offload integer math onto the GPU, for a massive performance boost. That would not help your floating-point micro-benchmark, but for other matrix calculations (e.g., image processing), Renderscript Compute may be well worth researching.

Processing power alone is not everything when you compare very different architectures. In fact, you're very likely not benchmarking the computing architectures alone.
A key factor in benchmarking. When you're dealing with something that takes a lot of variables into account, isolate the one you want to test, and keep others constant and preferably equal.
In your situation, some examples for variables that affect your result:
the actual computing architecture, which is a complex set of variables itself (processor design and implementation, memory hierarchy etc)
the OS
the different Java Virtual Machine implementation for the different variables above
the additional layers the Dalvik implies

There are at least eight sets of comparisons between PCs and Android devices for my numerous Android benchmarks in the following. Below are results from my Linpack benchmark (including Java) that show the Androids in a better light than your results. Other results (like Dhrystone) show that, on a per MHz basis, ARM’s CPUs can match Intel’s.
http://www.roylongbottom.org.uk/android%20benchmarks.htm
Linpack Benchmark Results
System ARM MHz Android Linpackv5 Linpackv7 LinpackSP NEONLinpack LinpackJava
See MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS
T1 926EJ 800 2.2 5.63 5.67 9.61 N/A 2.33
P4 v7-A8 800 2.3.5 80.18 28.34 #G
T2 v7-A9 800 2.3.4 10.56 101.39 129.05 255.77 33.36
P5 v7-A9 1500 4.0.3 171.39 50.87 #G
T4 v7-A9 1500a 4.0.3 16.86 155.52 204.61 382.46 56.89
T6 v7-A9 1600 4.0.3 196.47
T7 v7-A9 1300a 4.1.2 17.08 151.05 201.30 376.00 56.44
T9 926EJ 800 2.2 5.66
T11 v7-A15 2000b 4.2.2 28.82 459.17 803.04 1334.90 143.06
T12 v7-A9 1600 4.1.2 147.07
T14 v7-A9 1500 4.0.4 180.95
P11 v7-A9 1400 4.0.4 19.89 184.44 235.54 454.21 56.99
P10 QU-S4 1500 4.0.3 254.90
Measured MHz a=1200, b=1700
Atom 1666 Linux 204.09 215.73 117.81
Atom 1666 Windows 183.22 118.70
Atom 1666 And x86 15.65
Core 2 2400 Linux 1288.00 901.00
Core 2 2400 Windows 1315.29 551.00
Core 2 2400 And x86 53.27
System - T = Tablet, P = Phone, #G = GreenComputing, QU = Qualcomm CPU
And 86 = Android x86

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.