Processing Power - Mobile Devices vs Desktop - 100x difference?

Processing Power - Mobile Devices vs Desktop - 100x difference? - java

Has anyone compared the processing power of mobile devices with PC? I have a very simple matrix work. Coded in Java, it takes ~115ms for my old PC to finish the work. THE VERY VERY SAME FUNCTION takes 17000 ms. I was very shocked. I didn't expect that the tablet would be close to PC - but I didn't expect it is ~150x slower either!!
Has anyone had a similar experience? Any suggestion? Does it help if I write the code in C and use Android NDK?
The benchmark code in Java:
package mainpackage;
import java.util.Date;
public class mainclass {
public static void main(String[] args){
Date startD = new Date();
double[][] testOut;
double[] v = {1,0,0};
double t;
for (int i = 0; i < 100000; i++) {
t=Math.random();
testOut=rot_mat(v, t);
}
Date endD = new Date();
System.out.println("Time Taken ms: "+(-startD.getTime()+endD.getTime()));
}
public static double[][] rot_mat(double v[], double t)
{
double absolute;
double x[] = new double[3];
double temp[][] = new double[3][3];
double temp_2[][] = new double[3][3];
double sum;
int i;
int k;
int j;
// Normalize the v matrix into k
absolute = abs_val_vec(v);
for (i = 0; i < 3; i++)
{
x[i] = v[i] / absolute;
}
// Create 3x3 matrix kx
double kx[][] = {{0, -x[2], x[1]},
{x[2], 0, -x[0]},
{-x[1], x[0], 0}};
// Calculate output
// Calculate third term in output
for (i = 0; i < 3; i++)
{
for (j = 0; j < 3; j++)
{
sum = 0;
for (k = 0; k < 3; k++)
{
sum = sum + kx[i][k] * kx[k][j];
}
temp[i][j] = (1-Math.cos(t))*sum;
}
}
// Calculate second term in output
for (i = 0; i < 3; i++)
{
for (k = 0; k < 3; k++)
{
temp_2[i][k] = Math.sin(t)*kx[i][k];
}
}
// Calculate output
double[][] resOut = new double[3][3];
for (i = 0; i < 3; i++)
{
for (k = 0; k < 3; k++)
{
resOut[i][k] = temp_2[i][k] + temp[i][k] + ((i==k)?1:0);
}
}
return resOut;
}
private static double abs_val_vec (double v[])
{
double output;
output = Math.sqrt(v[0]*v[0] + v[1]*v[1] + v[2]*v[2]);
return output;
}
}

Any suggestion?
Micro-benchmarks only measure the performance of the micro-benchmark. And, the only decent way to interpret micro-benchmarks is with micro-measurements. Hence, savvy programmers would use tools like Traceview to get a better sense of where their time is being taken.
I suspect that if you ran this through Traceview, and looked at LogCat, you would find that your time is being spent in two areas:
Memory allocation and garbage collection. Your micro-benchmark is chewing through ~3MB of heap space. In production code, you'd never do that, at least if you wanted to keep your job.
Floating-point operations. Depending upon your tablet, you may not have a floating-point co-processor, and doing floating-point math on the CPU sans a floating-point co-processor is very very slow.
Does it help if I write the code in C and use Android NDK?
Well, until you profile the code under Traceview, that will be difficult to answer. For example, if the time is mostly spent in sqrt(), cos(), and sin(), that already is native code, and you won't get much faster.
More importantly, even if this micro-benchmark might improve with native code, all that does is demonstrate that this micro-benchmark might improve with native code. For example, a C translation of this might be faster due to manual heap management (malloc() and free()) rather than garbage collection. But that is more an indictment of how poorly the micro-benchmark was written than it is a statement about how much faster C will be, as production Java code would be optimized better than this.
Beyond learning how to use Traceview, I suggest:
Reading the NDK documentation, as it includes information about when native code may make sense.
Reading up on Renderscript Compute. On some devices, using Renderscript Compute can offload integer math onto the GPU, for a massive performance boost. That would not help your floating-point micro-benchmark, but for other matrix calculations (e.g., image processing), Renderscript Compute may be well worth researching.

Processing power alone is not everything when you compare very different architectures. In fact, you're very likely not benchmarking the computing architectures alone.
A key factor in benchmarking. When you're dealing with something that takes a lot of variables into account, isolate the one you want to test, and keep others constant and preferably equal.
In your situation, some examples for variables that affect your result:
the actual computing architecture, which is a complex set of variables itself (processor design and implementation, memory hierarchy etc)
the OS
the different Java Virtual Machine implementation for the different variables above
the additional layers the Dalvik implies

There are at least eight sets of comparisons between PCs and Android devices for my numerous Android benchmarks in the following. Below are results from my Linpack benchmark (including Java) that show the Androids in a better light than your results. Other results (like Dhrystone) show that, on a per MHz basis, ARM’s CPUs can match Intel’s.
http://www.roylongbottom.org.uk/android%20benchmarks.htm
Linpack Benchmark Results
System ARM MHz Android Linpackv5 Linpackv7 LinpackSP NEONLinpack LinpackJava
See MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS
T1 926EJ 800 2.2 5.63 5.67 9.61 N/A 2.33
P4 v7-A8 800 2.3.5 80.18 28.34 #G
T2 v7-A9 800 2.3.4 10.56 101.39 129.05 255.77 33.36
P5 v7-A9 1500 4.0.3 171.39 50.87 #G
T4 v7-A9 1500a 4.0.3 16.86 155.52 204.61 382.46 56.89
T6 v7-A9 1600 4.0.3 196.47
T7 v7-A9 1300a 4.1.2 17.08 151.05 201.30 376.00 56.44
T9 926EJ 800 2.2 5.66
T11 v7-A15 2000b 4.2.2 28.82 459.17 803.04 1334.90 143.06
T12 v7-A9 1600 4.1.2 147.07
T14 v7-A9 1500 4.0.4 180.95
P11 v7-A9 1400 4.0.4 19.89 184.44 235.54 454.21 56.99
P10 QU-S4 1500 4.0.3 254.90
Measured MHz a=1200, b=1700
Atom 1666 Linux 204.09 215.73 117.81
Atom 1666 Windows 183.22 118.70
Atom 1666 And x86 15.65
Core 2 2400 Linux 1288.00 901.00
Core 2 2400 Windows 1315.29 551.00
Core 2 2400 And x86 53.27
System - T = Tablet, P = Phone, #G = GreenComputing, QU = Qualcomm CPU
And 86 = Android x86

Related

Digital filtering in java Android is too slow

I am trying to implement real-time digital filtering of audio signal in Android. I used standard code for my high-pass filter:
void doFilter(final short in[], short out[], int sizeIn) {
int i, j;
for (i = 0; i < sizeIn; i++) {
out[i] = 0;
for (j = 0; j < size; j++)
if (i >= j) out[i] += H[j] * in[i - j];
else out[i] += H[j] * dataTail[i + size - j];
}
System.arraycopy(in, sizeIn - size - 1, dataTail, 0, size);
}
The problem is that this code works very slowly to filter the signal from the microphone in real time: 200ms for a filter of 100 samples and input data of a long 1700 samples. What is the reason and what are the ways to solve this problem? Advise, please complete library for fast signal filtering. Thanks.

Java is very slow for large number of calculations as compared to C/C++, Consider using Android NDK for JNI to compile your main logic in C/C++.

For a length 100 FIR filter kernel, you will likely need to use a zero-padded FFT/IFFT overlap add/save fast convolution method, instead of a time domain convolution. This would be better written in C inside a JNI. I don't know of a library, but the algorithms might be explained inside some DSP textbooks.

Performance optimization: C++ vs Java not performing as expected

I have written two programs implementing a simple algorithm for matrix multiplication, one in C++ and one in Java. Contrary to my expectations, the Java program runs about 2.5x faster than the C++ program. I am a novice at C++, and would like suggestions on what I can change in the C++ program to make it run faster.
My programs borrow code and data from this blog post http://martin-thoma.com/matrix-multiplication-python-java-cpp .
Here are the current compilation flags I am using:
g++ -O3 main.cc
javac Main.java
Here are the current compiler/runtime versions:
$ g++ --version
g++.exe (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
My computer is a ~2012 era core i3 laptop running windows with MinGW. Here are the current performance results:
$ time ./a.exe < ../Testing/2000.in
507584919
real 0m36.469s
user 0m0.031s
sys 0m0.030s
$ time java Main < ../Testing/2000.in
507584919
real 0m14.299s
user 0m0.031s
sys 0m0.015s
Here is the C++ program:
#include <iostream>
#include <cstdio>
using namespace std;
int *A;
int *B;
int height;
int width;
int * matMult(int A[], int B[]) {
int * C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
int main() {
std::ios::sync_with_stdio(false);
cin >> height;
cin >> width;
A = new int[width*height];
B = new int[width*height];
for (int i = 0; i < width*height; i++) {
cin >> A[i];
}
for (int i = 0; i < width*height; i++) {
cin >> B[i];
}
int *result = matMult(A,B);
cout << result[2];
}
Here is the java program:
import java.util.*;
import java.io.*;
public class Main {
static int[] A;
static int[] B;
static int height;
static int width;
public static void main(String[] args) {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
height = Integer.parseInt(reader.readLine());
width = Integer.parseInt(reader.readLine());
A=new int[width*height];
B=new int[width*height];
int index = 0;
String thisLine;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
A[index] = Integer.parseInt(number);
index++;
}
}
}
index = 0;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
B[index] = Integer.parseInt(number);
index++;
}
}
}
int[] result = matMult(A,B);
System.out.println(result[2]);
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static int[] matMult(int[] A, int[] B) {
int[] C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
}
Here is a link to a 2000x2000 test case: https://mega.nz/#!sglWxZqb!HBts_UlZnR4X9gZR7bG-ej3xf2A5vUv0wTDUW-kqFMA
Here is a link to a 2x2 test case: https://mega.nz/#!QwkV2SII!AtfGuxPV5bQeZtt9eHNNn36rnV4sGq0_sJzitjiFE8s
Any advice explaining what I am doing wrong in C++, or why my C++ implementation is running so much slower than Java here, would be much appreciated!
EDIT: As suggested, I modified the programs so that they do not actually perform a multiplication, but just read the arrays in and print out one number from each. Here are the performance results for that. The C++ program has slower IO. That only accounts for part of the difference however.
$ time ./IOonly.exe < ../Testing/2000.in
7
944
real 0m8.158s
user 0m0.000s
sys 0m0.046s
$ time java IOOnly < ../Testing/2000.in
7
944
real 0m1.461s
user 0m0.000s
sys 0m0.047s

I'm not able to analyze the java execution, since it creates a temporary executable module that disappears after it's been "used". However, I assume that it does execute SSE instructions to get that speed [or that it unrolls the loop, which clang++ does if you disable SSE instructions]
But compiling with g++ (4.9.2) and clang++, I can clearly see that clang optimises the loop to use SSE instructions, where gcc doesn't. The resulting code is thus exactly 4 times slower. Changing the code so that it uses a constant value of 2000 in each dimension [so compiler "knows" the dimensions of the height and width], the gcc compiler also generates code that takes around 8s (on my machine!), compared to 27s with "variable" value [the clang compiled code is marginally faster as well here, but within the noise I'd say].
Overall conclusion: Quality/cleverness of compiler will highly affect the performance of tight loops. The more complex and varied the code is, the more likely it is that the C++ solution will generate better code, where simple and easy to compile problems are quite likely to be better in Java code [as a rule, but not guaranteed]. I expect the java compiler uses profiling to determine the number of loops for example.
Edit:
The result of time can be used to determine if the reading of the file is taking a long time, but you need some kind of profiling tool to determine if the actual input is using a lot of CPU-time and such.
The java engine uses a "just-in-time compiler", which uses profiling to determine the number of times a particular piece of code is hit (you can do that for C++ too, and big projects often do!), which allows it to for example unroll a loop, or determine at runtime the number of iterations in a loop. Given that this code does 2000 * 2000 * 2000 loops, and the C++ compiler actually does a BETTER job when it KNOWS the size of the values is telling us that the Java runtime isn't actually doing better (at least not initially), just that it manages to improve the performance over time.
Unfortunately, due to the way that the java runtime works, it doesn't leave the binary code behind, so I can't really analyze what it does.
The key here is that the actual operations you are doing are simple, and the logic is simple, it's just an awful lot of them, and you are doing them using a trivial implementation. Both Java and C++ will benefit from manually unrolling the loop, for example.

C++ is not faster than Java by default
C++ is fast as a language, but soon as you incorporate libraries into the mix, you are bound to these libraries' speed.
The standard is hardly built for performance, period. The standard libraries are written with design and correctness in mind.
C++ gives you the opportunity to optimize!
If you are unhappy with the standard library's performance, you can, and you should, use your own optimized version.
For example, standard C++ IO objects are beautiful when it comes to design (stream, locales, facets, inner buffers) but that makes them terrible at performance.
If you are writing for Windows OS, you can use ReadFile and WriteConsole as your mechanism for IO.
If you switch to these functions instead of the standard libraries - your program outperforms Java by a few orders of magnitude.

Java: why is computing faster than assigning value (int)?

The 2 following versions of the same function (which basically tries to recover a password by brute force) do not give same performance:
Version 1:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = length - 1;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
Version 2:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = refi;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
I would expect version 2 to be faster, as it does (and that is the only difference):
i = refi;
...as compare to version 1:
i = length -1;
However, it's the opposite: version 1 is faster by over 3%!
Does someone knows why? Is that due to some optimization done by the compiler?
Thank you all for your answers that far.
Just to add that the goal is actually not to optimize this piece of code (which is already quite optimized), but more to understand, from a compiler / CPU / architecture perspective, what could explain such performance difference.
Your answers have been very helpful, thanks again!
Key

It is difficult to check this in a micro-benchmark because you cannot say for sure how the code has been optimised without reading the machine code generated, even then the CPU can do plenty of tricks to optimise it future eg. it turns the x86 code in RISC style instructions to actually execute.
A computation takes as little as one cycle and the CPU can perform up to three of them at once. An access to L1 cache takes 4 cycles and for L2, L3, main memory it takes 11, 40-75, 200 cycles.
Storing values to avoid a simple calculation is actually slower in many cases. BTW using division and modulus is quite expensive and caching this value can be worth it when micro-tuning your code.

The correct answer should be retrievable by a deassembler (i mean .class -> .java converter),
but my guess is that the compiler might have decided to get rid of iref altogether and decided to store length - 1 an auxiliary register.
I'm more of a c++ guy, but I would start by trying:
const int refi = length - 1;
inside the for loop. Also you should probably use
indexes[ refi ] = 1;

Comparing running times of codes does not give exact or quarantine results
First of all, it is not the way comparing performances like this. A running time analysis is needed here. Both 2 codes have same loop structure and their running time are the same. You may have different running times when you run codes. However, they mostly differ with cache hits, I/O times, thread & process schedules. There is no quarantine that code is always completed in a exact time.
However, there is still differences in your code, to understand the difference you should look into your CPU architecture. I can explain it according to x86 architecture basically.
What happens behind the scenes?
i = refi;
CPU takes refi and i to its registers from ram. there is 2 access to ram if the values in not in the cache. and value of i will be written to the ram. However, it always takes different times according to thread & process schedules. Furrhermore, if the values are in virtual memory it wil take longer time.
i = length -1;
CPU also access i and length from ram or cache. there is same number of accesses. In addition, there is a subtraction here which means extra CPU cycles. That is why you think this one take longer time to complete. It is expected, but the issues that i mentioned above explain why this take longer time.
Summation
As i explain this is not the way of comparing performance. I think, there is no real difference between these codes. There are lots of optimizations inside CPU and also in compiler. You can see optimized codes if you decompile .class files.
My advice is it is better to minimize BigO running time analysis. If you find better algorithms it is the best way of optimizing codes. In case you still have bottlenecks in your code, you may try micro-benchmarking.
See also
Analysis of algorithms
Big O notation
Microprocessor
Compiler optimization
CPU Scheduling

To start with, you can't really compare the performance by just running your program - micro benchmarking in Java is complicated.
Also, a subtraction on modern CPUs can take as little as a third of a clock cycle on average. On a 3GHz CPU, that is 0.1 nanoseconds. And nothing tells you that the subtraction actually happens as the compiler might have modified the code.
So:
You should try to check the generated assembly code.
If you really care about the performance, create an appropriate micro-benchmark.

No speedup in multithread program

I was playing with Go language concurrency and found something which is kinda opaque to me.
I wrote parallel matrix multiplication, that is, each task computes single line of product matrix, multiplying corresponding rows and columns of source matrices.
Here is Java program
public static double[][] parallelMultiply(int nthreads, final double[][] m1, final double[][] m2) {
final int n = m1.length, m = m1[0].length, l = m2[0].length;
assert m1[0].length == m2.length;
double[][] r = new double[n][];
ExecutorService e = Executors.newFixedThreadPool(nthreads);
List<Future<double[]>> results = new LinkedList<Future<double[]>>();
for (int ii = 0; ii < n; ++ii) {
final int i = ii;
Future<double[]> result = e.submit(new Callable<double[]>() {
public double[] call() throws Exception {
double[] row = new double[l];
for (int j = 0; j < l; ++j) {
for (int k = 0; k < m; ++k) {
row[j] += m1[i][k]*m2[k][j];
}
}
return row;
}
});
results.add(result);
}
try {
e.shutdown();
e.awaitTermination(1, TimeUnit.HOURS);
int i = 0;
for (Future<double[]> result : results) {
r[i] = result.get();
++i;
}
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
return r;
}
and this is Go program
type Matrix struct {
n, m int
data [][]float64
}
func New(n, m int) *Matrix {
data := make([][]float64, n)
for i, _ := range data {
data[i] = make([]float64, m)
}
return &Matrix{n, m, data}
}
func (m *Matrix) Get(i, j int) float64 {
return m.data[i][j]
}
func (m *Matrix) Set(i, j int, v float64) {
m.data[i][j] = v
}
func MultiplyParallel(m1, m2 *Matrix) *Matrix {
r := New(m1.n, m2.m)
c := make(chan interface{}, m1.n)
for i := 0; i < m1.n; i++ {
go func(i int) {
innerLoop(r, m1, m2, i)
c <- nil
}(i)
}
for i := 0; i < m1.n; i++ {
<-c
}
return r
}
func innerLoop(r, m1, m2 *Matrix, i int) {
for j := 0; j < m2.m; j++ {
s := 0.0
for k := 0; k < m1.m; k++ {
s = s + m1.Get(i, k) * m2.Get(k, j)
}
r.Set(i, j, s)
}
}
When I use Java program with nthreads=1 and nthreads=2 there is nearly double speedup on my dual-core N450 Atom netbook.
When I use Go program with GOMAXPROCS=1 and GOMAXPROCS=2 there is no speedup at all!
Even though Java code uses additional storage for Futures and then collectes their values to the result matrix instead of direct array update in the worker code (that's what Go version does), it performs much more faster on several cores than Go version.
Especially funny is that Go version with GOMAXPROCS=2 loads both cores (htop displays 100% load on both processors while program works), but the time of computation is the same as with GOMAXPROCS=1 (htop displays 100% load only on one core in this case).
Another concern is that Java program is faster than Go one even in simple single-thread multiplication, but that is not exactly unexpected (taking benchmarks from here into account) and should not affect multicore performance multiplier.
What I'm doing incorrectly here? Is there a way to speedup Go program?
UPD:
it seems i found what I'm doing incorrectly. I was checking time of java program using System.currentTimeMillis() and Go program using time shell command. I mistakingly took 'user' time from zsh output as program working time instead of 'total' one. Now i double-checked the computation speed and it gives me nearly double speedup too (though it is slighlty lesser than Java's):
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q
env GOMAXPROCS=2 ./4-2-go -n 500 -q 22,34s user 0,04s system 99% cpu 22,483 total
% time env GOMAXPROCS=2 ./4-2-go -n 500 -q -p
env GOMAXPROCS=2 ./4-2-go -n 500 -q -p 24,09s user 0,10s system 184% cpu 13,080 total
Seems I have to be more attentive.
Still java program gives five time lesser times on the same case. But it is a matter for another question I think.

You are probably experiencing the effects of false sharing. In a nutshell, if two pieces of data happen to fall onto the same CPU cache line, modifying these two pieces of data from threads that execute on different CPU cores will trigger the expensive cache coherency protocol.
This kind of cache "ping-pong" is extremely hard to diagnose, and can happen on logically completely unrelated data, just because they happen to be placed close enough in memory. The 100% CPU load is typical of false sharing - your cores really are working 100%, they are just not working on your program - they are working on synchronizing their caches.
The fact that in Java program you have a thread-private data until the time comes to "integrate" it into the final result is what saves you from false sharing. I'm not familiar with Go, but judging on your own words, threads are writing directly to the common array, which is exactly the kind of thing that could trigger the false sharing. This is an example how a perfectly valid single-threaded reasoning does exactly the opposite in the multi-threaded environment!
For more in-depth discussion on the topic, I warmly recommend Herb Sutter's article: Eliminate False Sharing, or a lecture: Machine Architecture: Things Your Programming Language Never Told You (and associated PDF slides).

If you are able to run these code in Linux environment you can use perf to measure the false sharing effect.

For Linux, Windows 32 and ditto 64 there are also AMD's CodeXL and CodeAnalyst. They will profile an application running on an AMD processor in much greater detail than one from intel since the applicable performance registers are different.

Java code/library to calculate the earth mover's distance

I'm looking for java code (or a library) that calculates the earth mover's distance (EMD) between two histograms. This could be directly or indirectly (e.g. using the Hungarian algorithm). I found several implementations of this in c/c++ (e.g. "Fast and Robust Earth Mover's Distances", but I'm wondering if there is a Java version readily available.
I will be using the EMD calculation to evaluate the approach given by this paper in the context of a science project I'm working on.
Update
Using a variety of resources I estimate that the code below should do the trick. determineMinCostAssignment is the calculation of the optimal assignment as determined by the Hungarian algorithm. For this I will be using the code from http://konstantinosnedas.com/dev/soft/munkres.htm
My main concern is the calculated flow: I am not sure if this is correct. Is there someone who can verify that this is correct or not?
/**
* Determines the Earth Mover's Distance between two histogram assuming an equal distance between two buckets of a histogram. The distance between
* two buckets is equal to the differences in the indexes of the buckets.
*
* #param threshold
* The maximum distance to use between two buckets.
*/
public static double determineEarthMoversDistance(double[] histogram1, double[] histogram2, int threshold) {
if (histogram1.length != histogram2.length)
throw new InvalidParameterException("Each histogram must have the same number of elements");
double[][] groundDistances = new double[histogram1.length][histogram2.length];
for (int i = 0; i < histogram1.length; ++i) {
for (int j = 0; j < histogram2.length; ++j) {
int abs_diff = Math.abs(i - j);
groundDistances[i][j] = Math.min(abs_diff, threshold);
}
}
int[][] assignment = determineMinCostAssignment(groundDistances);
double costSum = 0, flowSum = 0;
for (int i = 0; i < assignment.length; i++) {
double cost = groundDistances[assignment[i][0]][assignment[i][1]];
double flow = histogram2[assignment[i][1]];
costSum += cost * flow;
flowSum += flow;
}
return costSum / flowSum;
}

Here's a pure Java port of the FastEMD algorithm, that I just released:
https://github.com/telmomenezes/JFastEMD

The website "Fast and Robust Earth Mover's Distances" has a Java wrapper for the C/C++ code with compiled binary for Linux and Windows.

This is what I use for Java/Scala:
import org.apache.commons.math3.ml.distance.EarthMoversDistance
new EarthMoversDistance().compute(observed, expected)

https://github.com/wihoho/VideoRecognition
Adapt the author's C implementation with python module through a file interface
The modified C codes are under the folder EarthMoverDistance SourceCode
I am pretty sure that you can do the same thing with Java. Just add a file interface to connect the C implementation of EMD with your Java codes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.