Greetings noble community,
I want to have the following loop:
for(i = 0; i < MAX; i++)
A[i] = B[i] + C[i];
This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.
(for simplicity, assume MAX is a multiple of 4)
Option 1:
for(i = tid; i < MAX; i += 4)
A[i] = B[i] + C[i];
Option 2:
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
A[i] = B[i] + C[i];
My question is if there's one that is more efficient then the other and why?
The second one is better than the first one. Simple answer: the second one minimize false sharing
Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.
When would this happen?
Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.
class Foo {
private int foo1;
private int foo2;
}
How bad is false sharing?
I refer Example 6 from the Gallery of Processor Cache Effects
private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
s_counter[position] = s_counter[position] + 3;
}
}
On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done.
On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!
How to detect false sharing?
Linux Perf could be used to detect cache misses and therefore help you analysis such problem.
refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:
Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
Performance counter stats for './cache_line_test 16 32 48 64':
36,992 L1-dcache-load-misses # 0.01% of all L1-dcache hits [50.51%]
It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.
Is there some good practice in industry?
LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm
The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.
For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:
abstract class RingBufferPad
{
protected long p1, p2, p3, p4, p5, p6, p7;
}
Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:
this.entries = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];
source
You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast
There are two different reasons why you should prefer option 2 over option 1. One of these is cache locality / cache contention, as explained in #qqibrow's answer; I won't explain that here as there's already a good answer explaining it.
The other reason is vectorisation. Most high-end modern processors have vector units which are capable of running the same instruction simultaneously on multiple different data (in particular, if the processor has multiple cores, it almost certainly has a vector unit, maybe even multiple vector units, on each core). For example, without the vector unit, the processor has an instruction to do an addition:
A = B + C;
and the corresponding instruction in the vector unit will do multiple additions at the same time:
A1 = B1 + C1;
A2 = B2 + C2;
A3 = B3 + C3;
A4 = B4 + C4;
(The exact number of additions will vary by processor model; on ints, common "vector widths" include 4 and 8 simultaneous additions, and some recent processors can do 16.)
Your for loop looks like an obvious candidate for using the vector unit; as long as none of A, B, and C are pointers into the same array but with different offsets (which is possible in C++ but not Java), the compiler would be allowed to optimise option 2 into
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i+=4) {
A[i+0] = B[i+0] + C[i+0];
A[i+1] = B[i+1] + C[i+1];
A[i+2] = B[i+2] + C[i+2];
A[i+3] = B[i+3] + C[i+3];
}
However, one limitation of the vector unit is related to memory accesses: vector units are only fast at accessing memory when they're accessing adjacent locations (such as adjacent elements in an array, or adjacent fields of a C struct). The option 2 code above is pretty much the best case for vectorisation of the code: the vector unit can access all the elements it needs from each array as a single block. If you tried to vectorise the option 1 code, the vector unit would take so long trying to find all the values it's working on in memory that the gains from vectorisation would be negated; it would be unlikely to run any faster than the non-vectorised code, because the memory access wouldn't be any faster, and the addition takes no time by comparison (because the processor can do the addition while it's waiting for the values to arrive from memory).
It isn't guaranteed that a compiler will be able to make use of the vector unit, but it would be much more likely to do so with option 2 than option 1. So you might find that option 2's advantage over option 1 is a factor of 4/8/16 more than you'd expect if you only took cache effects into account.
Related
Hi guys I'm trying to make a load generator and my goal is to compare how much of my system's resources are consumed when spawning Erlang processes as compared to spawning threads (Java). I am doing this by having the program count to 1000000000 10 times. Java takes roughly 35 seconds to finish the whole process with 10 threads created, Erlang takes ages with 10 processes, I grew impatient with it because it spent over 4 minutes counting. If I just make Erlang and Java count to 1000000000 without spawning threads/processes, Erlang takes 1 minute and 32 seconds and Java takes a good 3 or so seconds. I know Erlang is not made for crunching numbers but that much of a difference is alarming, why is there such a big difference ? Both use my CPU to 100% but no spike in RAM. I am not sure what other methods can be used to make this comparison, I am open to any suggestions as well.
here is the code for both versions
-module(loop).
-compile(export_all).
start(NumberOfProcesses) ->
loop(0, NumberOfProcesses).
%%Processes to spawn
loop(A, NumberOfProcesses) ->
if A < NumberOfProcesses ->
spawn(loop, outerCount, [0]),
loop(A+1, NumberOfProcesses);
true -> ok
end.
%%outer loop
outerCount(A) ->
if A < 10 ->
innerCount(0),
outerCount(A + 1);
true -> ok
end.
%%inner loop
innerCount(A) ->
if A < 1000000000 ->
innerCount(A+1);
true -> ok
end.
and java
import java.util.Scanner;
class Loop implements Runnable
{
public static void main(String[] args)
{
System.out.println("Input number of processes");
Scanner scan = new Scanner(System.in);
String theNumber = scan.nextLine();
for (int t = 0; t < Integer.parseInt(theNumber); t++)
{
new Thread(new Loop()).start();
}
}
public void run()
{
int i;
for (i = 0; i < 10; i++)
{
for (int j = 0; j < 1000000000; j++);
}
}
}
Are you running a 32- or 64-bit version of Erlang? If it's 32 bit, then the inner loop limit 1000000000 won't fit in a single-word fixnum (max 28 bits incl. sign), and the loop will start to do bignum arithmetic on the heap which is way way more expensive than just incrementing a word and looping (it will also cause garbage collection to happen now and then, to get rid of old unused numbers from the heap). Changing the outer loop from 10 to 1000 and removing 2 zeros correspondingly from the inner loop should make it use fixnum arithmetic only even on a 32-bit BEAM.
Then, it's also a question of whether the Java version is actually doing any work at all, or if the loop gets optimized away to a no-op at some point. (The Erlang compiler doesn't do that sort of trick - at least not yet.)
RichardC answer gives some clue to understand the difference of execution time. I will add also that if your java code is compiled, it may benefits a lot from the predictive branching of the microprocessor, and thus make a better use of the cache memories.
But the more important in my opinion is that you are not choosing the right ratio of Process/processing to evaluate the cost of process spawning.
The test use 10 processes that does some significant work. I would have chosen a test where many processes are spawned (some thousands? I don't know how much threads the JVM can manage) each process doing very few things, for example this code which spawn at each step twice the number of process and wait for the deepest processes to send back the message done. With a depth of 17, which means 262143 processes in total and 131072 returned messages, it takes less than 0.5 s on my very slow PC, that is less than 2µs per process (of course the dual core dual thread should be used)
-module (cascade).
-compile([export_all]).
test() ->
timer:tc(?MODULE,start,[]).
start() ->
spawn(?MODULE,child,[self(),17]),
loop(1024*128).
loop(0) -> done;
loop(N) ->
receive
done -> loop(N-1)
end.
child(P,0) -> P ! done;
child(P,N) ->
spawn(?MODULE,child,[P,N-1]),
spawn(?MODULE,child,[P,N-1]).
There are a few problems here.
I don't know how you can evaluate what the Java compiler is doing, but I'd wager it's optimizing the loop out of existence. I think you'd have to have the loop do something meaningful to make any sort of comparison.
More importantly, the Erlang code is not doing what you think it's doing, as best as I can tell. It appears that each process is counting up to 1000000000, and then doing it again for a total of 10 times.
Perhaps worse, your functions are not tail recursive, so your functions keep accumulating in memory waiting for the last one to execute. (Edit: I may be wrong about that. Unaccustomed to the if statement.)
Here's Erlang that does what you want it to do. It's still very slow.
-module(realloop).
-compile(export_all).
start(N) ->
loop(0, N).
loop(N, N) ->
io:format("Spawned ~B processes~n", [N]);
loop(A, N) ->
spawn(realloop, count, [0, 1000000000]),
loop(A+1, N).
count(Upper, Upper) ->
io:format("Reached ~B~n", [Upper]);
count(Lower, Upper) ->
count(Lower+1, Upper).
I am writing a code where I am doing some calculations on array values and storing result back to array. Demo Code is as follows -
public class Test {
private int[] x = new int[100000000];
/**
* #param args
* #throws Exception
*/
public static void main(String[] args) throws Exception {
Test t = new Test();
long start = System.nanoTime();
for(int i=0;i<100000000;i++) {
t.testing(i);
}
System.out.println("time = " + (System.nanoTime() - start)/1000);
}
public void testing(int a) throws Exception {
int b=1,c=0;
if(b<c || b < 1) {
throw new Exception("Invalid inputs");
}
int d= a>>b;
int e = a & 0x0f;
int f = x[d];
int g = x[e];
x[d] = f | g;
}
}
Main logic of program lies in
int d= a>>b;
int e = a & 0x0f;
x[d] = f | g;
When I test this code, it took 110ms.
But instead of assiging result back to x[d], if I assign it to a variable as
int h = f | g;
it took only 3 ms.
I want to assign result back to Array only, but it is hampering performance by big margin.
This is a time critical program.
So I want to know if there's any alternative to arrays in Java or any other way I can avoid this hampering?
I tested this code under default sun JVM config.
P.S. I tried UNSAFE API, but it isnt helping.
What you want to beware of is the JVM optimising the code to nothing because it isn't doing anything useful.
In you case you are performing 100 million calls in 110 ms or about 1.1 nano-second per call. Given a single memory to L1 cache access takes 4 clock cycles this is pretty fast. In your test where you got 100 million in 3 ms, this suggests it is taking 0.03 nano-seconds per call or about 1/10th of a clock cycle. To me this doesn't sound likely and I would expect that if the doubled the length of the loop it would still take 3 ms. i.e. you are timing how long it takes to detect and eliminate the code.
A basic problem you have is that you have an array which is 400 MB in size. This will not fit in L1, L2 or L3 cache. Instead it could be going to main memory and this typically takes 200 clock cycles. The best option is to reduce the size of your array so it at least fits in your L3 cache. How big is your L3 cache? If it is say 24 MB, try reducing the array to just 16 MB and you should see a performance improvement.
There are a number of things that could be happening. First of all, try running each version of your program multiple times consecutively and averaging those. Secondly, assigning to an array in Java is a method call that performs error checking (such as throwing ArrayIndexOutOfBoundsException when necessary). This is naturally going to be a bit slower than a variable assignment. If you have a really time-sensitive piece of code, consider using JNI for the numerical operations: http://docs.oracle.com/javase/6/docs/technotes/guides/jni/. This will often make your array logic faster.
That's because h is a local variable and is allocated on the stack, whereas the array is stored in the main memory, which is way slower to write to.
Also note that, if this is really a high-performance application, you should put your main logic inside the for loop and avoid the overhead of calling a method. The instructions could be inlined for you, but you should not rely on it.
I'm writing a program in Java
In this program I'm reading and changing an array of data. This is an example of the code:
public double computation() {
char c = 0;
char target = 'a';
int x = 0, y = 1;
for (int i = 0; i < data.length; i++) {
// Read Data
c = data[index[i]];
if (c == target)
x++;
else
y++;
//Change Value
if (Character.isUpperCase(c))
Character.toLowerCase(c);
else
Character.toUpperCase(c);
//Write Data
data[index[i]] = c;
}
return (double) x / (double) y;
}
BTW, the INDEX array contains DATA array's indexes in random order to prevent prefetching. I'm forcing all of my cache accesses to be missed by using random indexes in INDEX array.
Now I want to check what is the behavior of the CPU cache by collecting information about its hit ratio.
Is there any developed tool for this purpose? If not is there any technique?
On Linux it is possible to collect such information via OProfile. Each CPU has performance event counters. See here for the list of the AMD K15 family events: http://oprofile.sourceforge.net/docs/amd-family15h-events.php
OProfile regularly samples the event counter(s) and together with the program counter. After a program run you can analyze how many events happen and at (statistically) what program position.
OProfile has build in Java support. It interacts with the Java JIT and creates a synthetic symbol table to look up the Java method name for a peace of generated JIT code.
The initial setup is not quite easy. If interested, I can guide you through or write a little more about it.
I don't think you can reach such low level information from Java but someone might know better. You could write the same program with no cache misses and check the difference. This is what I suggested in this other post for example.
In "Core java 1" I've read
CAUTION: An ArrayList is far
less efficient than an int[] array
because each value is separately
wrapped inside an object. You would
only want to use this construct for
small collections when programmer
convenience is more important than
efficiency.
But in my software I've already used Arraylist instead of normal arrays due to some requirements, though "The software is supposed to have high performance and after I've read the quoted text I started to panic!" one thing I can change is changing double variables to Double so as to prevent auto boxing and I don't know if that is worth it or not, in next sample algorithm
public void multiply(final double val)
{
final int rows = getSize1();
final int cols = getSize2();
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
this.get(i).set(j, this.get(i).get(j) * val);
}
}
}
My question is does changing double to Double makes a difference ? or that's a micro optimizing that won't affect anything ? keep in mind I might be using large matrices.2nd Should I consider redesigning the whole program again ?
The big issue with double versus Double is that the latter adds some amount of memory overhead -- 8 bytes per object on a Sun 32-bit JVM, possibly more or less on others. Then you need another 4 bytes (8 on a 64-bit JVM) to refer to the object.
So, assuming that you have 1,000,000 objects, the differences are as follows:
double[1000000]
8 bytes per entry; total = 8,000,000 bytes
Double[1000000]
16 bytes per object instance + 4 bytes per reference; total = 20,000,000 bytes
Whether or not this matters depends very much on your application. Unless you find yourself running out of memory, assume that it doesn't matter.
It changes the place where autoboxing happens, but nothing else.
And 2nd - no, don't worry about this. It is unlikely to be a bottleneck. You can make some benchmarks to measure it for the size of your data, to prove that the difference is insignificant in regard to your application performance.
Double is dramatically more expensive than double, however in 90% of cases it doesn't matter.
If you wanted an efficient matrix class, I would suggest you use one of the libraries which already do this efficiently. e.g. Jama.
Changing the double argument into Double won't help much, it will worsen performance slightly because it needs to be unboxed for the multiplication.
What will help is preventing multiple calls to get() as in:
for (int i = 0; i < rows; i++)
{
List row = this.get(i);
for (int j = 0; j < cols; j++)
{
row.set(j, row.get(j) * val);
}
}
(btw, I guessed the type for row.)
Assuming that you use a list of lists, using iterators instead of geting and setting via loop indices will win some more performance.
I am trying to make a Java port of a simple feed-forward neural network.
This obviously involves lots of numeric calculations, so I am trying to optimize my central loop as much as possible. The results should be correct within the limits of the float data type.
My current code looks as follows (error handling & initialization removed):
/**
* Simple implementation of a feedforward neural network. The network supports
* including a bias neuron with a constant output of 1.0 and weighted synapses
* to hidden and output layers.
*
* #author Martin Wiboe
*/
public class FeedForwardNetwork {
private final int outputNeurons; // No of neurons in output layer
private final int inputNeurons; // No of neurons in input layer
private int largestLayerNeurons; // No of neurons in largest layer
private final int numberLayers; // No of layers
private final int[] neuronCounts; // Neuron count in each layer, 0 is input
// layer.
private final float[][][] fWeights; // Weights between neurons.
// fWeight[fromLayer][fromNeuron][toNeuron]
// is the weight from fromNeuron in
// fromLayer to toNeuron in layer
// fromLayer+1.
private float[][] neuronOutput; // Temporary storage of output from previous layer
public float[] compute(float[] input) {
// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
neuronOutput[0][i] = input[i];
}
// Loop through layers
for (int layer = 1; layer < numberLayers; layer++) {
// Loop over neurons in the layer and determine weighted input sum
for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) {
// Bias neuron is the last neuron in the previous layer
int biasNeuron = neuronCounts[layer - 1];
// Get weighted input from bias neuron - output is always 1.0
float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];
// Get weighted inputs from rest of neurons in previous layer
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][inputNeuron][neuron];
}
// Store neuron output for next round of computation
neuronOutput[layer][neuron] = sigmoid(activation);
}
}
// Return output from network = output from last layer
float[] result = new float[outputNeurons];
for (int i = 0; i < outputNeurons; i++)
result[i] = neuronOutput[numberLayers - 1][i];
return result;
}
private final static float sigmoid(final float input) {
return (float) (1.0F / (1.0F + Math.exp(-1.0F * input)));
}
}
I am running the JVM with the -server option, and as of now my code is between 25% and 50% slower than similar C code. What can I do to improve this situation?
Thank you,
Martin Wiboe
Edit #1: After seeing the vast amount of responses, I should probably clarify the numbers in our scenario. During a typical run, the method will be called about 50.000 times with different inputs. A typical network would have numberLayers = 3 layers with 190, 2 and 1 neuron, respectively. The innermost loop will therefore have about 2*191+3=385 iterations (when counting the added bias neuron in layers 0 and 1)
Edit #1: After implementing the various suggestions in this thread, our implementation is practically as fast as the C version (within ~2 %). Thanks for all the help! All of the suggestions have been helpful, but since I can only mark one answer as the correct one, I will give it to #Durandal for both suggesting array optimizations and being the only one to precalculate the for loop header.
Some tips.
in your inner most loop, think about how you are traversing your CPU cache and re-arrange your matrix so you are accessing the outer most array sequentially. This will result in you accessing your cache in order rather than jumping all over the place. A cache hit can be two orders of magniture faster than a cache miss.
e.g restructure fWeights so it is accessed as
activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][neuron][inputNeuron];
don't perform work inside the loop (every time) which can be done outside the loop (once). Don't perform the [layer -1] lookup every time when you can place this in a local variable. Your IDE should be able to refactor this easily.
multi-dimensional arrays in Java are not as efficient as they are in C. They are actually multiple layers of single dimensional arrays. You can restructure the code so you're only using a single dimensional array.
don't return a new array when you can pass the result array as an argument. (Saves creating a new object on each call).
rather than peforming layer-1 all over the place, why not use layer1 as layer-1 and using layer1+1 instead of layer.
Disregarding the actual math, the array indexing in Java can be a performance hog in itself. Consider that Java has no real multidimensional arrays, but rather implements them as array of arrays. In your innermost loop, you access over multiple indices, some of which are in fact constant in that loop. Part of the array access can be move outside of the loop:
final int[] neuronOutputSlice = neuronOutput[layer - 1];
final int[][] fWeightSlice = fWeights[layer - 1];
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
activation += neuronOutputSlice[inputNeuron] * fWeightsSlice[inputNeuron][neuron];
}
It is possible that the server JIT performs a similar code invariant movement, the only way to find out is change and profile it. On the client JIT this should improve performance no matter what.
Another thing you can try is to precalculate the for-loop exit conditions, like this:
for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) { ... }
// transform to precalculated exit condition (move invariant array access outside loop)
for (int neuron = 0, neuronCount = neuronCounts[layer]; neuron < neuronCount; neuron++) { ... }
Again the JIT may already do this for you, so profile if it helps.
Is there a point to multiplying with 1.0F that eludes me here?:
float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];
Other things that could potentially improve speed at cost of readability: inline sigmoid() function manually (the JIT has a very tight limit for inlining and the function might be larger).
It can be slightly faster to run a loop backwards (where it doesnt change the outcome of course), since testing the loop index against zero is a little cheaper than checking against a local variable (the innermost loop is a potentical candidate again, but dont expect the output to be 100% identical in all cases, since adding floats a + b + c is potentially not the same as a + c + b).
For a start, don't do this:
// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
neuronOutput[0][i] = input[i];
}
But this:
System.arraycopy( input, 0, neuronOutput[0], 0, inputNeurons );
First thing I would look into is seeing if Math.exp is slowing you down. See this post on a Math.exp approximation for a native alternative.
Replace the expensive floating point sigmoid transfer function with an integer step transfer function.
The sigmoid transfer function is a model of organic analog synaptic learning, which in turn seems to be a model of a step function.
The historical precedent for this is that Hinton designed the back-prop algorithm directly from the first principles of cognitive science theories about real synapses, which in turn were based on real analog measurements, which turn out to be sigmoid.
But the sigmoid transfer function seems to be an organic model of the digital step function, which of course cannot be directly implemented organically.
Rather than model a model, replace the expensive floating point implementation of the organic sigmoid transfer function with the direct digital implementation of a step function (less than zero = -1, greater than zero = +1).
The brain cannot do this, but backprop can!
This not only linearly and drastically improves performance of a single learning iteration, it also reduces the number of learning iterations required to train the network: supporting evidence that learning is inherently digital.
Also supports the argument that Computer Science is inherently cool.
Purely based upon code inspection, your inner most loop has to compute references to a three-dimensional parameter and its being done a lot. Depending upon your array dimensions could you possibly be having cache issues due to have to jump around memory with each loop iteration. Maybe you could rearrange the dimensions so the inner loop tries to access memory elements that are closer to one another than they are now?
In any case, profile your code before making any changes and see where the real bottleneck is.
I suggest using a fixed point system rather than a floating point system. On almost all processors using int is faster than float. The simplest way to do this is simply shift everything left by a certain amount (4 or 5 are good starting points) and treat the bottom 4 bits as the decimal.
Your innermost loop is doing floating point maths so this may give you quite a boost.
The key to optimization is to first measure where the time is spent. Surround various parts of your algorithm with calls to System.nanoTime():
long start_time = System.nanoTime();
doStuff();
long time_taken = System.nanoTime() - start_time;
I'd guess that while using System.arraycopy() would help a bit, you'll find your real costs in the inner loop.
Depending on what you find, you might consider replacing the float arithmetic with integer arithmetic.