Why access volatile variable is about 100 slower than member? - java

Here I wrote a test about access speed of local, member, volatile member:
public class VolatileTest {
public int member = -100;
public volatile int volatileMember = -100;
public static void main(String[] args) {
int testloop = 10;
for (int i = 1; i <= testloop; i++) {
System.out.println("Round:" + i);
VolatileTest vt = new VolatileTest();
vt.runTest();
System.out.println();
}
}
public void runTest() {
int local = -100;
int loop = 1;
int loop2 = Integer.MAX_VALUE;
long startTime;
startTime = System.currentTimeMillis();
for (int i = 0; i < loop; i++) {
for (int j = 0; j < loop2; j++) {
}
for (int j = 0; j < loop2; j++) {
}
}
System.out.println("Empty:" + (System.currentTimeMillis() - startTime));
startTime = System.currentTimeMillis();
for (int i = 0; i < loop; i++) {
for (int j = 0; j < loop2; j++) {
local++;
}
for (int j = 0; j < loop2; j++) {
local--;
}
}
System.out.println("Local:" + (System.currentTimeMillis() - startTime));
startTime = System.currentTimeMillis();
for (int i = 0; i < loop; i++) {
for (int j = 0; j < loop2; j++) {
member++;
}
for (int j = 0; j < loop2; j++) {
member--;
}
}
System.out.println("Member:" + (System.currentTimeMillis() - startTime));
startTime = System.currentTimeMillis();
for (int i = 0; i < loop; i++) {
for (int j = 0; j < loop2; j++) {
volatileMember++;
}
for (int j = 0; j < loop2; j++) {
volatileMember--;
}
}
System.out.println("VMember:" + (System.currentTimeMillis() - startTime));
}
}
And here is a result on my X220 (I5 CPU):
Round:1
Empty:5
Local:10
Member:312
VMember:33378
Round:2
Empty:31
Local:0
Member:294
VMember:33180
Round:3
Empty:0
Local:0
Member:306
VMember:33085
Round:4
Empty:0
Local:0
Member:300
VMember:33066
Round:5
Empty:0
Local:0
Member:303
VMember:33078
Round:6
Empty:0
Local:0
Member:299
VMember:33398
Round:7
Empty:0
Local:0
Member:305
VMember:33139
Round:8
Empty:0
Local:0
Member:307
VMember:33490
Round:9
Empty:0
Local:0
Member:350
VMember:35291
Round:10
Empty:0
Local:0
Member:332
VMember:33838
It surprised me that access to volatile member is 100 times slower than normal member. I know there is some highlight feature about volatile member, such as a modification to it will be visible for all thread immediately, access point to volatile variable plays a role of "memory barrier". But can all these side effect be the main cause of 100 times slow?
PS: I also did a test on a Core II CPU machine. It is about 9:50, about 5 times slow. seems like this is also related to CPU arch. 5 times is still big, right?

The volatile members are never cached, so they are read directly from the main memory.

Acess to volatile prevents some JIT optimisaton. This is especially important if you have a loop which doesn't really do anything as the JIT can optimise such loops away (unless you have a volatile field) If you run the loops "long" the descrepancy should increase more.
In more realistic test, you might expect volatile to take between 30% and 10x slower for cirtical code. In most real programs it makes very little difference because the CPU is smart enough to "realise" that only one core is using the volatile field and cache it rather than using main memory.

Access to a volatile variable prevents the CPU from re-ordering the instructions before and after the access, and this generally slows down execution.

Using volatile will read from the memory directly so that every core of cpu will get the change at next get from the variable, there's no cpu cache used, which will not use register, L1~L3 cache tech, reading from
register 1 clock cycle
L1 cache 4 clock cycle
L2 cache 11 clock cycle
L3 cache 30~40 clock cycle
Memory 100+ clock cycle
That's why your result is about 100 times slower when using volatile.

Related

Is it possible to run my following code by parallel?

Actually the following code is doing the gaussian elimination for a matrix. And my job is to try some java concurrency technique to let it be a parallel program.
However, the problem is that each external loop has the data dependency which comes from the previous loop. And I have try that it is too costly to use the parallel technique inside the external loop. Can someone help me with it? How to let the following code run by parallel? Is there any technique in java concurrency technique can handle with this condition?
for (int i = 0; i <1; i++) {
int max = i;
for (int j = i + 1; j < N; j++) {
if (Math.abs(matrix[j][i]) > Math.abs(matrix[max][i])) {
max = j;
}
}
double[] temp = matrix[i];
matrix[i] = matrix[max];
matrix[max] = temp;
for (int k = i + 1; k < N; k++)
{
double alpha = matrix[k][i] / matrix[i][i];
for (int j = i; j < N; j++)
{
matrix[k][j] -= alpha * matrix[i][j];
}
}
}
The work done by the k loop can be made parallel, since the modified data is non-overlapping.
Simply delegate the work done by the body of the loop to a thread.
Easiest way is to use a Java 8 parallel stream, i.e. replace the k loop with:
final int ii = i; // since 'i' is not effectively-final
IntStream.range(ii + 1, N).parallel().forEach(k -> {
double alpha = matrix[k][ii] / matrix[ii][ii];
for (int j = ii; j < N; j++) {
matrix[k][j] -= alpha * matrix[ii][j];
}
});

Multi-threaded Matrix Multiplication in Java. Average times are off. Am I using executors correctly?

I'm trying to do mutli-threaded matrix multiplication in which I compare the execution times for a different number of threads starting from 1 to 100, incremented by 10 each iteration.
Basically I create two 100x100 matrices with random numbers ranging from -10.0 to 10.0 in their cells and then multiply them together. I will do that 25 times using a different amount of threads (again incremented by 10 each time: so the first iteration will use 1 thread, second iterations will use 10 threads, third will use 20 threads etc...) and find the average completion time and store that time in a file.
The problem I'm having is that I'm not exactly sure if I'm using the Executors correctly. For example, to me this snippet of code (i've also provided the entire program code below this snippet) is saying that I've created 10 threads and in each thread I will use the .execute method to run my LoopTaskA which happens to be the multiplication of the matrices. So what I am trying to do is have one multiplication that is split across these 10 threads. Is that what I'm doing here? Or am I multiplying 10 times across 10 threads (i.e. one multiplication per thread)?
The reason i'm asking this is because when I read the entire program, for every increase in thread count I get an increase in the average completion time. Shouldn't the completion time be reduced if I increase the number of threads since i'm splitting the workload?
According to this other question I found on this same website: maybe not? But i'm still unsure about what i'm doing wrong.
for(int i = 0; i < 25; i++)
{
ExecutorService execService = Executors.newFixedThreadPool(10);
startTime = System.nanoTime();
for(int j = 0; j < 10; j++)
{
execService.execute(new LoopTaskA(m1,m2));
}
import java.util.*;
import java.io.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class MatrixMultiplication {
static double[][] m1 = new double[100][100];
static double[][] m2 = new double[100][100];
public static void main(String[] args){
long startTime;
long endTime;
long completionTime;
ArrayList<Long> myTimes = new ArrayList<Long>();
long addingNumber = 0;
long averageTime;
String filepath = "exe_time.csv";
createMatrix();
/*This for loop will create 1 thread and then use the execute method from execService
to multiply the two 100x100 matrices together. The completionTime is how long it takes
for the whole process to finish. We want to run this thread 25 times and then take the average
of those completion times*/
for(int i = 0; i < 25; i++)
{
ExecutorService execService = Executors.newFixedThreadPool(1);
startTime = System.nanoTime();
execService.execute(new LoopTaskA(m1,m2));
execService.shutdown();
endTime = System.nanoTime();
completionTime = (endTime - startTime);
myTimes.add(completionTime);
System.out.println("The completion time for one iteration is: " + completionTime);
}
/*Takes the completion times that were stored in an arraylist and finds the average*/
for(int i = 0; i < 25; i++)
{
addingNumber = addingNumber + myTimes.remove(0);
}
averageTime = (addingNumber / 25);
System.out.println("The average run time in nanoseconds for 1 thread that ran 25 times is: " + averageTime);
saveRecord(averageTime, filepath);
/*We call createMatrix again here so we start with a fresh new matrix*/
createMatrix();
/*We are doing the same thing as before but now we have 10 threads and not 1*/
for(int i = 0; i < 25; i++)
{
ExecutorService execService = Executors.newFixedThreadPool(10);
startTime = System.nanoTime();
for(int j = 0; j < 10; j++)
{
execService.execute(new LoopTaskA(m1,m2));
}
execService.shutdown();
endTime = System.nanoTime();
completionTime = (endTime - startTime);
myTimes.add(completionTime);
System.out.println("The completion time for one iteration is: " + completionTime);
}
for(int i = 0; i < 25; i++)
{
addingNumber = addingNumber + myTimes.remove(0);
}
averageTime = (addingNumber / 25);
System.out.println("The average run time in nanoseconds for 10 threads that ran 25 times is: " + averageTime);
saveRecord(averageTime, filepath);
createMatrix();
/*We are doing the same thing as before but now we have 20 threads and not 10*/
for(int i = 0; i < 25; i++)
{
ExecutorService execService = Executors.newFixedThreadPool(20);
startTime = System.nanoTime();
for(int j = 0; j < 20; j++)
{
execService.execute(new LoopTaskA(m1,m2));
}
execService.shutdown();
endTime = System.nanoTime();
completionTime = (endTime - startTime);
myTimes.add(completionTime);
System.out.println("The completion time for one iteration is: " + completionTime);
}
for(int i = 0; i < 25; i++)
{
addingNumber = addingNumber + myTimes.remove(0);
}
averageTime = (addingNumber / 25);
System.out.println("The average run time in nanoseconds for 20 threads that ran 25 times is: " + averageTime);
saveRecord(averageTime, filepath);
}
/*Creates the matrix input by taking a random number from the range of
-10 to 10 and then truncates the number to two decimal places*/
public static double matrixInput(){
double max = 10.0;
double min = -10.0;
Random ran = new Random();
double random = min + (max - min) * ran.nextDouble();
double truncatedRan = Math.floor(random*100)/100;
return truncatedRan;
}
/*Places that random number generated in the matrixInput method into a cell of the matrix.
The goal is to create 2 random 100x100 matrices. The first 100x100 matrix is m1. The second is m2.*/
public static void createMatrix(){
for (int row = 0; row < m1.length; row++)
{
for (int col = 0; col < m1[0].length; col++)
{
m1[row][col] = matrixInput();
}
}
for (int row = 0; row < m2.length; row++)
{
for (int col = 0; col < m2[0].length; col++)
{
m2[row][col] = matrixInput();
}
}
}
/*Method that creates a .csv (comma seperated vector) file which stores
the average time*/
public static void saveRecord(long averageTime, String filepath)
{
try
{
FileWriter fw = new FileWriter(filepath,true);
BufferedWriter bw = new BufferedWriter(fw);
PrintWriter pw = new PrintWriter(bw);
pw.println(averageTime + ",");
pw.flush();
pw.close();
System.out.println("File has been saved.");
}
catch(Exception E)
{
System.out.println("File has NOT been saved.");
}
}
}
import java.util.*;
public class LoopTaskA implements Runnable{
double[][] m1;
double[][] m2;
#Override
public void run(){
double sum = 0;
/*This is to calculate the resulting matrix.We need to know the number or rows of m1
and the number of columns in m2 (both of which will be 100 since we want a 100x100 matrix)*/
double r[][] = new double [100][100];
/*This multiplies the two 100x100 matrices together. You can think of i here as the row number (which is 100).
The range of j will depend upon the number of columns in the resultant matrix (range of j = 100)
The k value will depend upon the number of columns in the first matrix or the number of rows in
the second matrix, both of these 100*/
for(int i = 0; i < 100; i++)
{
for(int j = 0; j < 100; j++)
{
for(int k = 0; k < 100; k++)
{
sum = sum + m1[i][k] * m2[k][j];
}
r[i][j] = Math.floor(sum*100)/100;
sum = 0; //reset to 0 so you can do the calculation for the next value.
}
}
/* for(int i = 0; i < 100; i++)
{
for(int j = 0; j < 100; j++)
{
System.out.print(r[i][j] + " ");
}
System.out.println();
} */
}
public LoopTaskA(double[][] m1, double[][] m2){
this.m1 = m1;
this.m2 = m2;
}
}
I found only one problem in your code, you should call awaitTermination to block current thread after shutdown. shutdown does not wait for previously submitted tasks to complete execution.
Shouldn't the completion time be reduced if I increase the number of
threads since i'm splitting the workload?
No, the available hard resources(for example, number of processors) are bounded.
Multithreading does not always bring higher performance, you can check this question.
Also, a ThreadPoolExecutor you created by Executors.newFixedThreadPool() is used to address specific problems:
Thread pools address two different problems: they usually provide
improved performance when executing large numbers of asynchronous
tasks, due to reduced per-task invocation overhead, and they provide
a means of bounding and managing the resources, including threads,
consumed when executing a collection of tasks.
So, technically, you are using ExecutorSevice the right way. But it does not gurantee that you can get higher performance when you increase the number of threads. And

Calling size() in a for loop condition, bad efficiency? [duplicate]

This question already has answers here:
Using collection size in for loop comparison
(4 answers)
Closed 7 years ago.
I just wanted to know in general, is this code inefficient:
for (int i = 0; i < array.size(); i++) {
//do something
}
as opposed to:
int x = array.size();
for (int i = 0; i < x; i++) {
//do something
}
or is it negligible? (How about in nested for loops?)
Assuming array is an ArrayList, it's of almost no difference since the implementation of size() merely accesses a member field:
public int size() {
return size;
}
The second code just saves the field value in a local variable and re-uses it in the loop instead of accessing the field every time, so that's just a difference between an access to a local variable versus an access to a field (accessing a local variable is slightly faster).
You can test it yourself doing some test like below:
public static void main(String[] args) {
ArrayList<Long> array = new ArrayList<Long>(99999);
int i = 0;
while (i < 99999) {
array.add(1L);
i++;
}
long ini1 = System.currentTimeMillis();
i = 0;
for (int j = 0; j < array.size(); j++) {
i += array.get(j);
}
long end1 = System.currentTimeMillis();
System.out.println("Time1: " + (end1 - ini1));
long ini2 = System.currentTimeMillis();
i = 0;
for (int j = 0; j < 99999; j++) {
i += array.get(j);
}
long end2 = System.currentTimeMillis();
System.out.println("Time2: " + (end2 - ini2));
}
Output:
Time1: 13
Time2: 10
I think that the difference its irrelevant in most applications and cases, i run the test several times and the times vary but the difference keeps "constant" at least in terms of percentage...
Arrays don't have a size, but length
for (int i = 0; i < array.length; i++) {
//do something
}
Efficiency is O(1).
actually, performance is almost the same if array.size is not very big.
u can always make like this:
for (int i = 0, x = array.length; i < x; i++) {
//do something
}

Null/Object and Null/Null comparison efficiency

This question lead me to do some testing:
public class Stack
{
public static void main(String[] args)
{
Object obj0 = null;
Object obj1 = new Object();
long start;
long end;
double difference;
double differenceAvg = 0;
for (int j = 0; j < 100; j++)
{
start = System.nanoTime();
for (int i = 0; i < 1000000000; i++)
if (obj0 == null);
end = System.nanoTime();
difference = end - start;
differenceAvg +=difference;
}
System.out.println(differenceAvg/100);
differenceAvg = 0;
for (int j = 0; j < 100; j++)
{
start = System.nanoTime();
for (int i = 0; i < 1000000000; i++)
if (null == obj0);
end = System.nanoTime();
difference = end - start;
differenceAvg +=difference;
}
System.out.println(differenceAvg/100);
differenceAvg = 0;
for (int j = 0; j < 100; j++)
{
start = System.nanoTime();
for (int i = 0; i < 1000000000; i++)
if (obj1 == null);
end = System.nanoTime();
difference = end - start;
differenceAvg +=difference;
}
System.out.println(differenceAvg/100);
differenceAvg = 0;
for (int j = 0; j < 100; j++)
{
start = System.nanoTime();
for (int i = 0; i < 1000000000; i++)
if (null == obj1);
end = System.nanoTime();
difference = end - start;
differenceAvg +=difference;
}
System.out.println(differenceAvg/100);
}
}
Tangential to the other post, it's interesting to note how much faster the comparison is when the Object that we're comparing is initialized. The first two numbers in each output are when the Object was null and the latter two numbers are when the Object was initialized. I ran 21 additional executions of the program, in all 30 executions, the comparison was much faster when the Object was initialized. What's going on here?
If you move last two loops to the beginning you will get the same results, so comparisons are irrelevant.
It's all about JIT compiler warm-up. During the first 2 loops java starts with interpreting bytecode. After some iterations, it determines that code path is "hot", so it compiles it to machine code and removes the loops that have no effect, so you are basically measuring System.nanotime and double arithmetic.
I'm not really sure why two loops are slow. I think that after it finds two hot paths it decides to optimize entire method.

Traversal performance of multidimensional array in Java

In code and the results below, We can see that “Traverse2” is much faster than "Traverse1", indeed they just traverse the same number of elements.
1.How does this difference happened?
2.Putting longer interation inside shorter interation will have a better performance?
public class TraverseTest {
public static void main(String[] args)
{
int a[][] = new int[100][10];
System.out.println(System.currentTimeMillis());
//Traverse1
for(int i = 0; i < 100; i++)
{
for(int j = 0; j < 10; j++)
a[i][j] = 1;
}
System.out.println(System.currentTimeMillis());
//Traverse2
for(int i = 0; i < 10; i++)
{
for(int j = 0; j < 100; j++)
a[j][i] = 2;
}
System.out.println(System.currentTimeMillis());
}
}
Result:
1347116569345
1347116569360
1347116569360
If i change it to
System.out.println(System.nanoTime());
The result will be:
4888285195629
4888285846760
4888285914219
It means that if we put longer interation inside will have a better performance. And it seems to have some conflicts with cache hits theory.
I suspect that any strangeness in the results you are seeing in this micro-benchmark are due to flaws in the benchmark itself.
For example:
Your benchmark does not take account of "JVM warmup" effects, such as the fact that the JIT compiler does not compile to native code immediately. (This only happens after the code has executed for a bit, and the JVM has measured some usage numbers to aid optimization.) The correct way to deal with this is to put the whole lot inside a loop that runs a few times, and discard any initial sets of times that that look "odd" ... due to warmup effects.
The loops in your benchmark could in theory be optimized away. The JIT compiler might be able to deduce that they don't do any work that affects the program's output.
Finally, I'd just like to remind you that hand-optimizing like this is usually a bad idea ... unless you've got convincing evidence that it is worth your while hand-optimizing AND that this code is really where the application is spending significant time.
First, always run microbenchmark tests several times in a loop. Then you'll see both times are 0, as the array sizes are too small. To get non-zero times, increase array sizes in 100 times. My times are roughly 32 ms for Traverse1 and 250 for Traverse2.
The difference is because processor use cache memory. Access to sequential memory addresses is much faster.
My output(with you original code 100i/10j vs 10i/100j ):
1347118083906
1347118083906
1347118083906
You are using a very bad time resolution for a very quick calculation.
I changed the i and j limit to 1000 both.
int a[][] = new int[1000][1000];
System.out.println(System.currentTimeMillis());
//Traverse1
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j < 1000; j++)
a[i][j] = 1;
}
System.out.println(System.currentTimeMillis());
//Traverse2
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j < 1000; j++)
a[j][i] = 2;
}
System.out.println(System.currentTimeMillis());
output:
1347118210671
1347118210687 //difference is 16 ms
1347118210703 //difference is 16 ms again -_-
Two possibilities:
Java hotspot changes the second loop into a first-type or optimizes
with exchanging i and j.
Time resolution is still not enough.
So i changed output as System.nanoTime()
int a[][] = new int[1000][1000];
System.out.println(System.nanoTime());
//Traverse1
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j < 1000; j++)
a[i][j] = 1;
}
System.out.println(System.nanoTime());
//Traverse2
for(int i = 0; i < 1000; i++)
{
for(int j = 0; j < 1000; j++)
a[j][i] = 2;
}
System.out.println(System.nanoTime());
Output:
16151040043078
16151047859993 //difference is 7800000 nanoseconds
16151061346623 //difference is 13500000 nanoseconds --->this is half speed
1.How does this difference happened?
Note that even ommiting you just used wrong time-resolution, you are making wrong comparations vs inequal cases. First is contiguous-access while second is not.
Lets say first nested loops are just a heating-preparing for the second one then it would make your assumption of "second is much faster" even more wrong.
Dont forget that 2D-array is an "array of arrays" in java. So, the right-most index would show a contiguous area. Faster for the first version.
2.Putting longer interation inside shorter interation will have a better performance?
for(int i = 0; i < 10; i++)
{
for(int j = 0; j < 100; j++)
a[j][i] = 2;
}
Increasing the first index is slower because the next iteration goes kbytes away so you cannot use your cache-line anymore.
Absolutely not!
In my point of view, size of array also affects the result. Like:
public class TraverseTest {
public static void main(String[] args)
{
int a[][] = new int[10000][2];
System.out.println(System.currentTimeMillis());
//Traverse1
for(int i = 0; i < 10000; i++)
{
for(int j = 0; j < 2; j++)
a[i][j] = 1;
}
System.out.println(System.currentTimeMillis());
//Traverse2
for(int i = 0; i < 2; i++)
{
for(int j = 0; j < 10000; j++)
a[j][i] = 2;
}
System.out.println(System.currentTimeMillis());
}
}
Traverse1 needs 10000*3+1 = 30001 comparisons to decide whether to exit the iteration,
however Traverse2 only needs 2*10001+1 = 20003 comparisons.
Traverse1 needs 1.5 times then number of comparisons of Traverse2.

Categories

Resources