I use a toolbox written in Java javaplex [https://github.com/javaplex/javaplex.github.io][1]
the link for the input matrix mat [https://drive.google.com/file/d/0B3uM9Np2kJYoTmtkRHV2WU5JeGc/view?usp=sharing][3]
the first loop which is working very good is for
cd C:\ProjetBCodesMatlab\Jplex
javaaddpath('./lib/javaplex.jar');
import edu.stanford.math.plex4.*;
javaaddpath('./lib/plex-viewer.jar');
import edu.stanford.math.plex_viewer.*;
cd './utility';
addpath(pwd);
cd '..';
max_dimension = 3;
max_filtration_value = 1000;
num_divisions = 10000;
options.max_filtration_value = max_filtration_value;
options.max_dimension = max_dimension - 1;
%------------------------------------------------------------
for i=1:10
maColonne = mat(i,:);
intervals= Calcul_interval(maColonne,options,max_dimension, max_filtration_value, num_divisions)
intervals
multinterval{i}= intervals;
end
I use a I7 when I execute the function feature('numCores');
MATLAB detected: 4 physical cores.
MATLAB detected: 8 logical cores.
MATLAB was assigned: 8 logical cores by the OS.
MATLAB is using: 4 logical cores.
MATLAB is not using all logical cores because hyper-threading is enabled.
I run the same code with parfor
cd C:\ProjetBCodesMatlab\Jplex
javaaddpath('./lib/javaplex.jar');
import edu.stanford.math.plex4.*;
javaaddpath('./lib/plex-viewer.jar');
import edu.stanford.math.plex_viewer.*;
cd './utility';
addpath(pwd);
cd '..';
max_dimension = 3;
max_filtration_value = 1000;
num_divisions = 10000;
options.max_filtration_value = max_filtration_value;
options.max_dimension = max_dimension - 1;
%------------------------------------------------------------
parfor i=1:10
maColonne = mat(i,:);
intervals= Calcul_interval(maColonne,options,max_dimension, max_filtration_value, num_divisions)
intervals
multinterval{i}= intervals;
end
i have this error : is not serializable.
Finally I got it reproduced. It can not be serialized because your variable intervals is holding an object of type edu.stanford.math.plex4.homology.barcodes.BarcodeCollection which is not serializable. You have to make it serializable or extract the relevant data on the workers. The parallel computing toolbox can only transport data which can be serialized.
Related
I am trying to compare a simple addition task with both CPU and GPU, but the results that I get are so weird.
First of all, let me explain how I managed to run the GPU task.
Let's dive into code now this is my code it simply
package gpu;
import com.aparapi.Kernel;
import com.aparapi.Range;
public class Try {
public static void main(String[] args) {
final int size = 512;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float) (Math.random() * 100);
b[i] = (float) (Math.random() * 100);
}
//##############CPU-TASK########################
long start = System.nanoTime();
final float[] sum = new float[size];
for(int i=0;i<size;i++){
sum[i] = a[i] + b[i];
}
long finish = System.nanoTime();
long timeElapsed = finish - start;
//######################################
//##############GPU-TASK########################
final float[] sum2 = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum2[gid] = a[gid] + b[gid];
}
};
long start1 = System.nanoTime();
kernel.execute(Range.create(size));
long finish2 = System.nanoTime();
long timeElapsed2 = finish2 - start1;
//##############GPU-TASK########################
System.out.println("cpu"+timeElapsed);
System.out.println("gpu"+timeElapsed2);
kernel.dispose();
}
}
My specs are:
Aparapi is running on an untested OpenCL platform version: OpenCL 3.0 CUDA 11.6.13
Intel Core i7 6850K # 3.60GHz Broadwell-E/EP 14nm Technology
2047MB NVIDIA GeForce GTX 1060 6GB (ASUStek Computer Inc)
The results that I get are this:
cpu12000
gpu5732829900
My question is why the performance of GPU is so slow. Why does CPU outperform GPU? I expect from GPU to be faster than the CPU does, my calculations are wrong, any way to improve it?
This code is measured the host side execution time for GPU task. It means that the measured time includes the time of the task execution on GPU, the time of copying the data for the task to GPU, the time of reading the data from GPU and the overhead that is introduced by Aparapi. And, according to the documentation for Kernel class, Aparapi uses lazy initialization:
On the first call to Kernel.execute(int _globalSize), Aparapi will determine the EXECUTION_MODE of the kernel.
This decision is made dynamically based on two factors:
Whether OpenCL is available (appropriate drivers are installed and the OpenCL and Aparapi dynamic libraries are included on the system path).
Whether the bytecode of the run() method (and every method that can be called directly or indirectly from the run() method)
can be converted into OpenCL.
Therefore, the host side execution time for GPU task cannot be compared with the execution time for CPU task. Because it includes additional work that is performed only once.
In this case, it is necessary to use getProfileInfo() call to get the execution time breakdown for the kernel:
kernel.execute(Range.create(size));
List<ProfileInfo> profileInfo = kernel.getProfileInfo();
for (final ProfileInfo p : profileInfo) {
System.out.println(p.getType() + " " + p.getLabel() + " " + (p.getEnd() - p.getStart()) + "ns");
}
Also, please note that the following property must be set: -Dcom.aparapi.enableProfiling=true. For more information please see Profiling the Kernel article and the implementation of ProfileInfo class.
I have following code for Spark:
package my.spark;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
public class ExecutionTest {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("ExecutionTest")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
int slices = 2;
int n = slices;
List<String> list = new ArrayList<>(n);
for (int i = 0; i < n; i++) {
list.add("" + i);
}
JavaRDD<String> dataSet = jsc.parallelize(list, slices);
dataSet.foreach(str -> {
System.out.println("value: " + str);
Thread.sleep(10000);
});
System.out.println("done");
spark.stop();
}
}
I have run master node and two workers (everything on localhost; Windows) using the commands:
bin\spark-class org.apache.spark.deploy.master.Master
and (two times):
bin\spark-class org.apache.spark.deploy.worker.Worker spark://<local-ip>:7077
Everything started correctly.
After submitting my job using command:
bin\spark-submit --class my.spark.ExecutionTest --master spark://<local-ip>:7077 file:///<pathToFatJar>/FatJar.jar
Command started, but the value: 0 and value: 1 outputs are written by one of the workers (as displayed on Logs > stdout on page associated with the worker). Second worker has nothing in Logs > stdout. As far as I understood, this means, that each iteration is done by the same worker.
How to run these tasks on two different running workers?
It is possible, but I'm not sure if it will work correctly every time and everywhere. However, while testing, every time it worked as expected.
I have tested my code using host machine with Windows 10 x64, and 4 Virtual Machines (VM): VirtualBox with Debian 9 (stretch) kernel 4.9.0 x64, Host-Only network, Java 1.8.0_144, Apache Spark 2.2.0 for Hadoop 2.7 (spark-2.2.0-bin-hadoop2.7.tar.gz).
I have been using master and 3 slaves on VM and one more slave on Windows:
debian-master - 1 CPU, 1 GB RAM
debian-slave1 - 1 CPU, 1 GB RAM
debian-slave2 - 1 CPU, 1 GB RAM
debian-slave3 - 2 CPU, 1 GB RAM
windows-slave - 4 CPU, 8 GB RAM
I was submitting my jobs from Windows machine to the master located on VM.
The beginning is the same as before:
SparkSession spark = SparkSession
.builder()
.config("spark.cores.max", coresCount) // not necessary
.appName("ExecutionTest")
.getOrCreate();
[important] coresCount is essential for partitioning - I have to partition data using the number of used cores, not number of workers/executors.
Next, I have to create JavaSparkContext and RDD. Reusing RDD allows for executing multiple times probably the same set of workers.
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
JavaRDD<Integer> rddList
= jsc.parallelize(
IntStream.range(0, coresCount * 2)
.boxed().collect(Collectors.toList()))
.repartition(coresCount);
I have created rddList that has coresCount * 2 elements. The number of elements equal to coresCount does not allow for running on all associated workers (in my case). Maybe, the coresCount + 1 would be enough, but I have not tested it as the coresCount * 2 is not much as well.
Next thing to do is to run commands:
List<String> hostsList
= rddList.map(value -> {
Thread.sleep(3_000);
return InetAddress.getLocalHost().getHostAddress();
})
.distinct()
.collect();
System.out.println("-----> hostsList = " + hostsList);
Thread.sleep(3_000) is necessary for proper distribution of tasks. 3 seconds are enough for me. Probably the value could be smaller, and sometimes, probably, a higher value will be necessary (I guess that value depends on, how fast the workers get tasks to execute from master).
The above code will run on each core associated with the worker, so more than one per worker. To run on each worker exactly one command, I have used the following code:
/* as static field of class */
private static final AtomicBoolean ONE_ON_WORKER = new AtomicBoolean(false);
...
long nodeCount
= rddList.map(value -> {
Thread.sleep(3_000);
if (ONE_ON_WORKER.getAndSet(true) == false) {
System.out.println("Executed on "
+ InetAddress.getLocalHost().getHostName());
return 1;
} else {
return 0;
}
})
.filter(val -> val != 0)
.count();
System.out.println("-----> finished using #nodes = " + nodeCount);
And of course, at the end, the stop:
spark.stop();
I have written two programs implementing a simple algorithm for matrix multiplication, one in C++ and one in Java. Contrary to my expectations, the Java program runs about 2.5x faster than the C++ program. I am a novice at C++, and would like suggestions on what I can change in the C++ program to make it run faster.
My programs borrow code and data from this blog post http://martin-thoma.com/matrix-multiplication-python-java-cpp .
Here are the current compilation flags I am using:
g++ -O3 main.cc
javac Main.java
Here are the current compiler/runtime versions:
$ g++ --version
g++.exe (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ java -version
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)
My computer is a ~2012 era core i3 laptop running windows with MinGW. Here are the current performance results:
$ time ./a.exe < ../Testing/2000.in
507584919
real 0m36.469s
user 0m0.031s
sys 0m0.030s
$ time java Main < ../Testing/2000.in
507584919
real 0m14.299s
user 0m0.031s
sys 0m0.015s
Here is the C++ program:
#include <iostream>
#include <cstdio>
using namespace std;
int *A;
int *B;
int height;
int width;
int * matMult(int A[], int B[]) {
int * C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
int main() {
std::ios::sync_with_stdio(false);
cin >> height;
cin >> width;
A = new int[width*height];
B = new int[width*height];
for (int i = 0; i < width*height; i++) {
cin >> A[i];
}
for (int i = 0; i < width*height; i++) {
cin >> B[i];
}
int *result = matMult(A,B);
cout << result[2];
}
Here is the java program:
import java.util.*;
import java.io.*;
public class Main {
static int[] A;
static int[] B;
static int height;
static int width;
public static void main(String[] args) {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
height = Integer.parseInt(reader.readLine());
width = Integer.parseInt(reader.readLine());
A=new int[width*height];
B=new int[width*height];
int index = 0;
String thisLine;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
A[index] = Integer.parseInt(number);
index++;
}
}
}
index = 0;
while ((thisLine = reader.readLine()) != null) {
if (thisLine.trim().equals("")) {
break;
} else {
String[] lineArray = thisLine.split("\t");
for (String number : lineArray) {
B[index] = Integer.parseInt(number);
index++;
}
}
}
int[] result = matMult(A,B);
System.out.println(result[2]);
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static int[] matMult(int[] A, int[] B) {
int[] C = new int[height*width];
int n = height;
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
for (int j = 0; j < n; j++) {
C[width*i+j]+=A[width*i+k] * B[width*k+j];
}
}
}
return C;
}
}
Here is a link to a 2000x2000 test case: https://mega.nz/#!sglWxZqb!HBts_UlZnR4X9gZR7bG-ej3xf2A5vUv0wTDUW-kqFMA
Here is a link to a 2x2 test case: https://mega.nz/#!QwkV2SII!AtfGuxPV5bQeZtt9eHNNn36rnV4sGq0_sJzitjiFE8s
Any advice explaining what I am doing wrong in C++, or why my C++ implementation is running so much slower than Java here, would be much appreciated!
EDIT: As suggested, I modified the programs so that they do not actually perform a multiplication, but just read the arrays in and print out one number from each. Here are the performance results for that. The C++ program has slower IO. That only accounts for part of the difference however.
$ time ./IOonly.exe < ../Testing/2000.in
7
944
real 0m8.158s
user 0m0.000s
sys 0m0.046s
$ time java IOOnly < ../Testing/2000.in
7
944
real 0m1.461s
user 0m0.000s
sys 0m0.047s
I'm not able to analyze the java execution, since it creates a temporary executable module that disappears after it's been "used". However, I assume that it does execute SSE instructions to get that speed [or that it unrolls the loop, which clang++ does if you disable SSE instructions]
But compiling with g++ (4.9.2) and clang++, I can clearly see that clang optimises the loop to use SSE instructions, where gcc doesn't. The resulting code is thus exactly 4 times slower. Changing the code so that it uses a constant value of 2000 in each dimension [so compiler "knows" the dimensions of the height and width], the gcc compiler also generates code that takes around 8s (on my machine!), compared to 27s with "variable" value [the clang compiled code is marginally faster as well here, but within the noise I'd say].
Overall conclusion: Quality/cleverness of compiler will highly affect the performance of tight loops. The more complex and varied the code is, the more likely it is that the C++ solution will generate better code, where simple and easy to compile problems are quite likely to be better in Java code [as a rule, but not guaranteed]. I expect the java compiler uses profiling to determine the number of loops for example.
Edit:
The result of time can be used to determine if the reading of the file is taking a long time, but you need some kind of profiling tool to determine if the actual input is using a lot of CPU-time and such.
The java engine uses a "just-in-time compiler", which uses profiling to determine the number of times a particular piece of code is hit (you can do that for C++ too, and big projects often do!), which allows it to for example unroll a loop, or determine at runtime the number of iterations in a loop. Given that this code does 2000 * 2000 * 2000 loops, and the C++ compiler actually does a BETTER job when it KNOWS the size of the values is telling us that the Java runtime isn't actually doing better (at least not initially), just that it manages to improve the performance over time.
Unfortunately, due to the way that the java runtime works, it doesn't leave the binary code behind, so I can't really analyze what it does.
The key here is that the actual operations you are doing are simple, and the logic is simple, it's just an awful lot of them, and you are doing them using a trivial implementation. Both Java and C++ will benefit from manually unrolling the loop, for example.
C++ is not faster than Java by default
C++ is fast as a language, but soon as you incorporate libraries into the mix, you are bound to these libraries' speed.
The standard is hardly built for performance, period. The standard libraries are written with design and correctness in mind.
C++ gives you the opportunity to optimize!
If you are unhappy with the standard library's performance, you can, and you should, use your own optimized version.
For example, standard C++ IO objects are beautiful when it comes to design (stream, locales, facets, inner buffers) but that makes them terrible at performance.
If you are writing for Windows OS, you can use ReadFile and WriteConsole as your mechanism for IO.
If you switch to these functions instead of the standard libraries - your program outperforms Java by a few orders of magnitude.
Has anyone compared the processing power of mobile devices with PC? I have a very simple matrix work. Coded in Java, it takes ~115ms for my old PC to finish the work. THE VERY VERY SAME FUNCTION takes 17000 ms. I was very shocked. I didn't expect that the tablet would be close to PC - but I didn't expect it is ~150x slower either!!
Has anyone had a similar experience? Any suggestion? Does it help if I write the code in C and use Android NDK?
The benchmark code in Java:
package mainpackage;
import java.util.Date;
public class mainclass {
public static void main(String[] args){
Date startD = new Date();
double[][] testOut;
double[] v = {1,0,0};
double t;
for (int i = 0; i < 100000; i++) {
t=Math.random();
testOut=rot_mat(v, t);
}
Date endD = new Date();
System.out.println("Time Taken ms: "+(-startD.getTime()+endD.getTime()));
}
public static double[][] rot_mat(double v[], double t)
{
double absolute;
double x[] = new double[3];
double temp[][] = new double[3][3];
double temp_2[][] = new double[3][3];
double sum;
int i;
int k;
int j;
// Normalize the v matrix into k
absolute = abs_val_vec(v);
for (i = 0; i < 3; i++)
{
x[i] = v[i] / absolute;
}
// Create 3x3 matrix kx
double kx[][] = {{0, -x[2], x[1]},
{x[2], 0, -x[0]},
{-x[1], x[0], 0}};
// Calculate output
// Calculate third term in output
for (i = 0; i < 3; i++)
{
for (j = 0; j < 3; j++)
{
sum = 0;
for (k = 0; k < 3; k++)
{
sum = sum + kx[i][k] * kx[k][j];
}
temp[i][j] = (1-Math.cos(t))*sum;
}
}
// Calculate second term in output
for (i = 0; i < 3; i++)
{
for (k = 0; k < 3; k++)
{
temp_2[i][k] = Math.sin(t)*kx[i][k];
}
}
// Calculate output
double[][] resOut = new double[3][3];
for (i = 0; i < 3; i++)
{
for (k = 0; k < 3; k++)
{
resOut[i][k] = temp_2[i][k] + temp[i][k] + ((i==k)?1:0);
}
}
return resOut;
}
private static double abs_val_vec (double v[])
{
double output;
output = Math.sqrt(v[0]*v[0] + v[1]*v[1] + v[2]*v[2]);
return output;
}
}
Any suggestion?
Micro-benchmarks only measure the performance of the micro-benchmark. And, the only decent way to interpret micro-benchmarks is with micro-measurements. Hence, savvy programmers would use tools like Traceview to get a better sense of where their time is being taken.
I suspect that if you ran this through Traceview, and looked at LogCat, you would find that your time is being spent in two areas:
Memory allocation and garbage collection. Your micro-benchmark is chewing through ~3MB of heap space. In production code, you'd never do that, at least if you wanted to keep your job.
Floating-point operations. Depending upon your tablet, you may not have a floating-point co-processor, and doing floating-point math on the CPU sans a floating-point co-processor is very very slow.
Does it help if I write the code in C and use Android NDK?
Well, until you profile the code under Traceview, that will be difficult to answer. For example, if the time is mostly spent in sqrt(), cos(), and sin(), that already is native code, and you won't get much faster.
More importantly, even if this micro-benchmark might improve with native code, all that does is demonstrate that this micro-benchmark might improve with native code. For example, a C translation of this might be faster due to manual heap management (malloc() and free()) rather than garbage collection. But that is more an indictment of how poorly the micro-benchmark was written than it is a statement about how much faster C will be, as production Java code would be optimized better than this.
Beyond learning how to use Traceview, I suggest:
Reading the NDK documentation, as it includes information about when native code may make sense.
Reading up on Renderscript Compute. On some devices, using Renderscript Compute can offload integer math onto the GPU, for a massive performance boost. That would not help your floating-point micro-benchmark, but for other matrix calculations (e.g., image processing), Renderscript Compute may be well worth researching.
Processing power alone is not everything when you compare very different architectures. In fact, you're very likely not benchmarking the computing architectures alone.
A key factor in benchmarking. When you're dealing with something that takes a lot of variables into account, isolate the one you want to test, and keep others constant and preferably equal.
In your situation, some examples for variables that affect your result:
the actual computing architecture, which is a complex set of variables itself (processor design and implementation, memory hierarchy etc)
the OS
the different Java Virtual Machine implementation for the different variables above
the additional layers the Dalvik implies
There are at least eight sets of comparisons between PCs and Android devices for my numerous Android benchmarks in the following. Below are results from my Linpack benchmark (including Java) that show the Androids in a better light than your results. Other results (like Dhrystone) show that, on a per MHz basis, ARM’s CPUs can match Intel’s.
http://www.roylongbottom.org.uk/android%20benchmarks.htm
Linpack Benchmark Results
System ARM MHz Android Linpackv5 Linpackv7 LinpackSP NEONLinpack LinpackJava
See MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS
T1 926EJ 800 2.2 5.63 5.67 9.61 N/A 2.33
P4 v7-A8 800 2.3.5 80.18 28.34 #G
T2 v7-A9 800 2.3.4 10.56 101.39 129.05 255.77 33.36
P5 v7-A9 1500 4.0.3 171.39 50.87 #G
T4 v7-A9 1500a 4.0.3 16.86 155.52 204.61 382.46 56.89
T6 v7-A9 1600 4.0.3 196.47
T7 v7-A9 1300a 4.1.2 17.08 151.05 201.30 376.00 56.44
T9 926EJ 800 2.2 5.66
T11 v7-A15 2000b 4.2.2 28.82 459.17 803.04 1334.90 143.06
T12 v7-A9 1600 4.1.2 147.07
T14 v7-A9 1500 4.0.4 180.95
P11 v7-A9 1400 4.0.4 19.89 184.44 235.54 454.21 56.99
P10 QU-S4 1500 4.0.3 254.90
Measured MHz a=1200, b=1700
Atom 1666 Linux 204.09 215.73 117.81
Atom 1666 Windows 183.22 118.70
Atom 1666 And x86 15.65
Core 2 2400 Linux 1288.00 901.00
Core 2 2400 Windows 1315.29 551.00
Core 2 2400 And x86 53.27
System - T = Tablet, P = Phone, #G = GreenComputing, QU = Qualcomm CPU
And 86 = Android x86
What is the semantics of compare and swap in Java? Namely, does the compare and swap method of an AtomicInteger just guarantee ordered access between different threads to the particular memory location of the atomic integer instance, or does it guarantee ordered access to all the locations in memory, i.e. it acts as if it were a volatile (a memory fence).
From the docs:
weakCompareAndSet atomically reads and conditionally writes a variable but does not create any happens-before orderings, so provides no guarantees with respect to previous or subsequent reads and writes of any variables other than the target of the weakCompareAndSet.
compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.
It's apparent from the API documentation that compareAndSet acts as if it were a volatile variable. However, weakCompareAndSet is supposed to just change its specific memory location. Thus, if that memory location is exclusive to the cache of a single processor, weakCompareAndSet is supposed to be much faster than the regular compareAndSet.
I'm asking this because I've benchmarked the following methods by running threadnum different threads, varying threadnum from 1 to 8, and having totalwork=1e9 (the code is written in Scala, a statically compiled JVM language, but both its meaning and bytecode translation are isomorphic to that of Java in this case - this short snippets should be clear):
val atomic_cnt = new AtomicInteger(0)
val atomic_tlocal_cnt = new java.lang.ThreadLocal[AtomicInteger] {
override def initialValue = new AtomicInteger(0)
}
def loop_atomic_tlocal_cas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_tlocal_cnt.get
while (i < until) {
i += 1
acnt.compareAndSet(i - 1, i)
}
acnt.get + i
}
def loop_atomic_weakcas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_cnt
while (i < until) {
i += 1
acnt.weakCompareAndSet(i - 1, i)
}
acnt.get + i
}
def loop_atomic_tlocal_weakcas = {
var i = 0
val until = totalwork / threadnum
val acnt = atomic_tlocal_cnt.get
while (i < until) {
i += 1
acnt.weakCompareAndSet(i - 1, i)
}
acnt.get + i
}
on an AMD with 4 dual 2.8 GHz cores, and a 2.67 GHz 4-core i7 processor. The JVM is Sun Server Hotspot JVM 1.6. The results show no performance difference.
Specs: AMD 8220 4x dual-core # 2.8 GHz
Test name: loop_atomic_tlocal_cas
Thread num.: 1
Run times: (showing last 3)
7504.562 7502.817 7504.626 (avg = 7415.637 min = 7147.628 max = 7504.886 )
Thread num.: 2
Run times: (showing last 3)
3751.553 3752.589 3751.519 (avg = 3713.5513 min = 3574.708 max = 3752.949 )
Thread num.: 4
Run times: (showing last 3)
1890.055 1889.813 1890.047 (avg = 2065.7207 min = 1804.652 max = 3755.852 )
Thread num.: 8
Run times: (showing last 3)
960.12 989.453 970.842 (avg = 1058.8776 min = 940.492 max = 1893.127 )
Test name: loop_atomic_weakcas
Thread num.: 1
Run times: (showing last 3)
7325.425 7057.03 7325.407 (avg = 7231.8682 min = 7057.03 max = 7325.45 )
Thread num.: 2
Run times: (showing last 3)
3663.21 3665.838 3533.406 (avg = 3607.2149 min = 3529.177 max = 3665.838 )
Thread num.: 4
Run times: (showing last 3)
3664.163 1831.979 1835.07 (avg = 2014.2086 min = 1797.997 max = 3664.163 )
Thread num.: 8
Run times: (showing last 3)
940.504 928.467 921.376 (avg = 943.665 min = 919.985 max = 997.681 )
Test name: loop_atomic_tlocal_weakcas
Thread num.: 1
Run times: (showing last 3)
7502.876 7502.857 7502.933 (avg = 7414.8132 min = 7145.869 max = 7502.933 )
Thread num.: 2
Run times: (showing last 3)
3752.623 3751.53 3752.434 (avg = 3710.1782 min = 3574.398 max = 3752.623 )
Thread num.: 4
Run times: (showing last 3)
1876.723 1881.069 1876.538 (avg = 4110.4221 min = 1804.62 max = 12467.351 )
Thread num.: 8
Run times: (showing last 3)
959.329 1010.53 969.767 (avg = 1072.8444 min = 959.329 max = 1880.049 )
Specs: Intel i7 quad-core # 2.67 GHz
Test name: loop_atomic_tlocal_cas
Thread num.: 1
Run times: (showing last 3)
8138.3175 8130.0044 8130.1535 (avg = 8119.2888 min = 8049.6497 max = 8150.1950 )
Thread num.: 2
Run times: (showing last 3)
4067.7399 4067.5403 4068.3747 (avg = 4059.6344 min = 4026.2739 max = 4068.5455 )
Thread num.: 4
Run times: (showing last 3)
2033.4389 2033.2695 2033.2918 (avg = 2030.5825 min = 2017.6880 max = 2035.0352 )
Test name: loop_atomic_weakcas
Thread num.: 1
Run times: (showing last 3)
8130.5620 8129.9963 8132.3382 (avg = 8114.0052 min = 8042.0742 max = 8132.8542 )
Thread num.: 2
Run times: (showing last 3)
4066.9559 4067.0414 4067.2080 (avg = 4086.0608 min = 4023.6822 max = 4335.1791 )
Thread num.: 4
Run times: (showing last 3)
2034.6084 2169.8127 2034.5625 (avg = 2047.7025 min = 2032.8131 max = 2169.8127 )
Test name: loop_atomic_tlocal_weakcas
Thread num.: 1
Run times: (showing last 3)
8132.5267 8132.0299 8132.2415 (avg = 8114.9328 min = 8043.3674 max = 8134.0418 )
Thread num.: 2
Run times: (showing last 3)
4066.5924 4066.5797 4066.6519 (avg = 4059.1911 min = 4025.0703 max = 4066.8547 )
Thread num.: 4
Run times: (showing last 3)
2033.2614 2035.5754 2036.9110 (avg = 2033.2958 min = 2023.5082 max = 2038.8750 )
While it's possible that thread locals in the example above end up in the same cache lines, it seems to me that there is no observable performance difference between regular CAS and its weak version.
This could mean that, in fact, a weak compare and swap acts as fully fledged memory fence, i.e. acts as if it were a volatile variable.
Question: Is this observation correct? Also, is there a known architecture or Java distribution for which a weak compare and set is actually faster? If not, what is the advantage of using a weak CAS in the first place?
A weak compare and swap could act as a full volatile variable, depending on the implementation of the JVM, sure. In fact, I wouldn't be surprised if on certain architectures it is not possible to implement a weak CAS in a notably more performant way than the normal CAS. On these architectures, it may well be the case that weak CASes are implemented exactly the same as a full CAS. Or it might simply be that your JVM has not had much optimisation put into making weak CASes particularly fast, so the current implementation just invokes a full CAS because it's quick to implement, and a future version will refine this.
The JLS simply says that a weak CAS does not establish a happens-before relationship, so it's simply that there is no guarantee that the modification it causes is visible in other threads. All you get in this case is the guarantee that the compare-and-set operation is atomic, but with no guarantees about the visibility of the (potentially) new value. That's not the same as guaranteeing that it won't be seen, so your tests are consistent with this.
In general, try to avoid making any conclusions about concurrency-related behaviour through experimentation. There are so many variables to take into account, that if you don't follow what the JLS guarantees to be correct, then your program could break at any time (perhaps on a different architecture, perhaps under more aggressive optimisation that's prompted by a slight change in the layout of your code, perhaps under future builds of the JVM that don't exist yet, etc.). There's never a reason to assume you can get away with something that's stated not to be guaranteed, because experiments show that "it works".
The x86 instruction for "atomically compare and swap" is LOCK CMPXCHG. This instruction creates a full memory fence.
There is no instruction that does this job without creating a memory fence, so it is very likely that both compareAndSet and weakCompareAndSet map to LOCK CMPXCHG and perform a full memory fence.
But that's for x86, other architectures (including future variants of x86) may do things differently.
weakCompareAndSwap is not guaranteed to be faster; it's just permitted to be faster. You can look at the open-source code of the OpenJDK to see what some smart people decided to do with this permission:
source code of compareAndSet
source code of weakCompareAndSet
Namely: They're both implemented as the one-liner
return unsafe.compareAndSwapObject(this, valueOffset, expect, update);
They have exactly the same performance, because they have exactly the same implementation! (in OpenJDK at least). Other people have remarked on the fact that you can't really do any better on x86 anyway, because the hardware already gives you a bunch of guarantees "for free". It's only on simpler architectures like ARM that you have to worry about it.