Related
Greetings noble community,
I want to have the following loop:
for(i = 0; i < MAX; i++)
A[i] = B[i] + C[i];
This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.
(for simplicity, assume MAX is a multiple of 4)
Option 1:
for(i = tid; i < MAX; i += 4)
A[i] = B[i] + C[i];
Option 2:
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
A[i] = B[i] + C[i];
My question is if there's one that is more efficient then the other and why?
The second one is better than the first one. Simple answer: the second one minimize false sharing
Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.
When would this happen?
Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.
class Foo {
private int foo1;
private int foo2;
}
How bad is false sharing?
I refer Example 6 from the Gallery of Processor Cache Effects
private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
s_counter[position] = s_counter[position] + 3;
}
}
On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done.
On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!
How to detect false sharing?
Linux Perf could be used to detect cache misses and therefore help you analysis such problem.
refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:
Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
Performance counter stats for './cache_line_test 16 32 48 64':
36,992 L1-dcache-load-misses # 0.01% of all L1-dcache hits [50.51%]
It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.
Is there some good practice in industry?
LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm
The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.
For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:
abstract class RingBufferPad
{
protected long p1, p2, p3, p4, p5, p6, p7;
}
Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:
this.entries = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];
source
You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast
There are two different reasons why you should prefer option 2 over option 1. One of these is cache locality / cache contention, as explained in #qqibrow's answer; I won't explain that here as there's already a good answer explaining it.
The other reason is vectorisation. Most high-end modern processors have vector units which are capable of running the same instruction simultaneously on multiple different data (in particular, if the processor has multiple cores, it almost certainly has a vector unit, maybe even multiple vector units, on each core). For example, without the vector unit, the processor has an instruction to do an addition:
A = B + C;
and the corresponding instruction in the vector unit will do multiple additions at the same time:
A1 = B1 + C1;
A2 = B2 + C2;
A3 = B3 + C3;
A4 = B4 + C4;
(The exact number of additions will vary by processor model; on ints, common "vector widths" include 4 and 8 simultaneous additions, and some recent processors can do 16.)
Your for loop looks like an obvious candidate for using the vector unit; as long as none of A, B, and C are pointers into the same array but with different offsets (which is possible in C++ but not Java), the compiler would be allowed to optimise option 2 into
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i+=4) {
A[i+0] = B[i+0] + C[i+0];
A[i+1] = B[i+1] + C[i+1];
A[i+2] = B[i+2] + C[i+2];
A[i+3] = B[i+3] + C[i+3];
}
However, one limitation of the vector unit is related to memory accesses: vector units are only fast at accessing memory when they're accessing adjacent locations (such as adjacent elements in an array, or adjacent fields of a C struct). The option 2 code above is pretty much the best case for vectorisation of the code: the vector unit can access all the elements it needs from each array as a single block. If you tried to vectorise the option 1 code, the vector unit would take so long trying to find all the values it's working on in memory that the gains from vectorisation would be negated; it would be unlikely to run any faster than the non-vectorised code, because the memory access wouldn't be any faster, and the addition takes no time by comparison (because the processor can do the addition while it's waiting for the values to arrive from memory).
It isn't guaranteed that a compiler will be able to make use of the vector unit, but it would be much more likely to do so with option 2 than option 1. So you might find that option 2's advantage over option 1 is a factor of 4/8/16 more than you'd expect if you only took cache effects into account.
Arrays below are sorted without duplicates (contain unique positive integers) of small size (less than 5000) and intersection (see below) is called billion of times so any micro-optimization does matter. This article nicely describes how to speed up the below code in C language.
int i = 0, j = 0, c = 0, la = a.length, lb = b.length;
intersection = new int[Math.min(la, lb)];
while (i < la && j < lb) {
if (a[i] < b[j]) i++;
else if (a[i] > b[j]) j++;
else {
intersection[c] = a[i];
i++; j++; c++;
}
}
int[] intersectionZip = new int[c];
System.arraycopy(intersection, 0, intersectionZip, 0, c);
In Java I guess it is impossible to call those low-level instructions. But they mention that "it is possible to improve this approach using branchless implementation". How one would do it? Using switch? Or maybe substitute a[i] < b[j], a[i] > b[j] or a[i] == b[i] comparisons with binary operations on integer operands?
Binary search approach (with complexity O(la log(lb))) is not the case because la is not << than lb. Interesting how to change the if statements.
I don't think there's much you could do to improve that performance of that Java code. However, I would note that it is not doing the same thing as the C version. The C version is putting the intersection into an array that was preallocated by the caller. The Java version allocates the array itself ... and then reallocates and copies to a smaller array when it is finished.
I guess, you could change the Java version to make two passes over the input arrays, with the first one working out how big the input array needs to be ... but whether it helps or hinders will depend on the inputs.
There might be other special cases you could optimize for; e.g. if there were likely to be long runs of numbers in one array with nothing in that range in the other array you might be able to "optimistically" try to skip multiple numbers in one go; i.e. increment i or j by a larger number than 1.
But they mention that "it is possible to improve this approach using branchless implementation". How one would do it? Using switch?
Not a Java switch ... or a conditional expression because they both involve branches when translated to the native code.
I think he is referring to something like this: Branchless code that maps zero, negative, and positive to 0, 1, 2
FWIW it is a bad idea to try to do this kind of thing in Java. The problem is that the performance of tricky code sequences like that is dependent on details of the hardware architecture, instruction set, clock counts, etc that vary from one platform to the next. The Java JIT compiler's optimizer can do a pretty good job of optimizing your code ... but if you include tricky sequences:
it is not at all obvious or predictable how they will be translated to native code, and
you may well find that the trickiness actually inhibits useful optimizations that the JIT compiler might otherwise be able to do.
Having said that, it is not impossible that some future release of Java might include a superoptimizer ... along the lines of the one mentioned on the linked Q&A above ... that would be able to generate branchless sequences automatically. But bear in mind that superoptimization is very expensive to perform.
Maybe using ? : operator:
(a[i] < b[j]) ? i++ : ((a[i] > b[j]) ? j++ : ....
In "Core java 1" I've read
CAUTION: An ArrayList is far
less efficient than an int[] array
because each value is separately
wrapped inside an object. You would
only want to use this construct for
small collections when programmer
convenience is more important than
efficiency.
But in my software I've already used Arraylist instead of normal arrays due to some requirements, though "The software is supposed to have high performance and after I've read the quoted text I started to panic!" one thing I can change is changing double variables to Double so as to prevent auto boxing and I don't know if that is worth it or not, in next sample algorithm
public void multiply(final double val)
{
final int rows = getSize1();
final int cols = getSize2();
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
this.get(i).set(j, this.get(i).get(j) * val);
}
}
}
My question is does changing double to Double makes a difference ? or that's a micro optimizing that won't affect anything ? keep in mind I might be using large matrices.2nd Should I consider redesigning the whole program again ?
The big issue with double versus Double is that the latter adds some amount of memory overhead -- 8 bytes per object on a Sun 32-bit JVM, possibly more or less on others. Then you need another 4 bytes (8 on a 64-bit JVM) to refer to the object.
So, assuming that you have 1,000,000 objects, the differences are as follows:
double[1000000]
8 bytes per entry; total = 8,000,000 bytes
Double[1000000]
16 bytes per object instance + 4 bytes per reference; total = 20,000,000 bytes
Whether or not this matters depends very much on your application. Unless you find yourself running out of memory, assume that it doesn't matter.
It changes the place where autoboxing happens, but nothing else.
And 2nd - no, don't worry about this. It is unlikely to be a bottleneck. You can make some benchmarks to measure it for the size of your data, to prove that the difference is insignificant in regard to your application performance.
Double is dramatically more expensive than double, however in 90% of cases it doesn't matter.
If you wanted an efficient matrix class, I would suggest you use one of the libraries which already do this efficiently. e.g. Jama.
Changing the double argument into Double won't help much, it will worsen performance slightly because it needs to be unboxed for the multiplication.
What will help is preventing multiple calls to get() as in:
for (int i = 0; i < rows; i++)
{
List row = this.get(i);
for (int j = 0; j < cols; j++)
{
row.set(j, row.get(j) * val);
}
}
(btw, I guessed the type for row.)
Assuming that you use a list of lists, using iterators instead of geting and setting via loop indices will win some more performance.
I am trying to make a Java port of a simple feed-forward neural network.
This obviously involves lots of numeric calculations, so I am trying to optimize my central loop as much as possible. The results should be correct within the limits of the float data type.
My current code looks as follows (error handling & initialization removed):
/**
* Simple implementation of a feedforward neural network. The network supports
* including a bias neuron with a constant output of 1.0 and weighted synapses
* to hidden and output layers.
*
* #author Martin Wiboe
*/
public class FeedForwardNetwork {
private final int outputNeurons; // No of neurons in output layer
private final int inputNeurons; // No of neurons in input layer
private int largestLayerNeurons; // No of neurons in largest layer
private final int numberLayers; // No of layers
private final int[] neuronCounts; // Neuron count in each layer, 0 is input
// layer.
private final float[][][] fWeights; // Weights between neurons.
// fWeight[fromLayer][fromNeuron][toNeuron]
// is the weight from fromNeuron in
// fromLayer to toNeuron in layer
// fromLayer+1.
private float[][] neuronOutput; // Temporary storage of output from previous layer
public float[] compute(float[] input) {
// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
neuronOutput[0][i] = input[i];
}
// Loop through layers
for (int layer = 1; layer < numberLayers; layer++) {
// Loop over neurons in the layer and determine weighted input sum
for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) {
// Bias neuron is the last neuron in the previous layer
int biasNeuron = neuronCounts[layer - 1];
// Get weighted input from bias neuron - output is always 1.0
float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];
// Get weighted inputs from rest of neurons in previous layer
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][inputNeuron][neuron];
}
// Store neuron output for next round of computation
neuronOutput[layer][neuron] = sigmoid(activation);
}
}
// Return output from network = output from last layer
float[] result = new float[outputNeurons];
for (int i = 0; i < outputNeurons; i++)
result[i] = neuronOutput[numberLayers - 1][i];
return result;
}
private final static float sigmoid(final float input) {
return (float) (1.0F / (1.0F + Math.exp(-1.0F * input)));
}
}
I am running the JVM with the -server option, and as of now my code is between 25% and 50% slower than similar C code. What can I do to improve this situation?
Thank you,
Martin Wiboe
Edit #1: After seeing the vast amount of responses, I should probably clarify the numbers in our scenario. During a typical run, the method will be called about 50.000 times with different inputs. A typical network would have numberLayers = 3 layers with 190, 2 and 1 neuron, respectively. The innermost loop will therefore have about 2*191+3=385 iterations (when counting the added bias neuron in layers 0 and 1)
Edit #1: After implementing the various suggestions in this thread, our implementation is practically as fast as the C version (within ~2 %). Thanks for all the help! All of the suggestions have been helpful, but since I can only mark one answer as the correct one, I will give it to #Durandal for both suggesting array optimizations and being the only one to precalculate the for loop header.
Some tips.
in your inner most loop, think about how you are traversing your CPU cache and re-arrange your matrix so you are accessing the outer most array sequentially. This will result in you accessing your cache in order rather than jumping all over the place. A cache hit can be two orders of magniture faster than a cache miss.
e.g restructure fWeights so it is accessed as
activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][neuron][inputNeuron];
don't perform work inside the loop (every time) which can be done outside the loop (once). Don't perform the [layer -1] lookup every time when you can place this in a local variable. Your IDE should be able to refactor this easily.
multi-dimensional arrays in Java are not as efficient as they are in C. They are actually multiple layers of single dimensional arrays. You can restructure the code so you're only using a single dimensional array.
don't return a new array when you can pass the result array as an argument. (Saves creating a new object on each call).
rather than peforming layer-1 all over the place, why not use layer1 as layer-1 and using layer1+1 instead of layer.
Disregarding the actual math, the array indexing in Java can be a performance hog in itself. Consider that Java has no real multidimensional arrays, but rather implements them as array of arrays. In your innermost loop, you access over multiple indices, some of which are in fact constant in that loop. Part of the array access can be move outside of the loop:
final int[] neuronOutputSlice = neuronOutput[layer - 1];
final int[][] fWeightSlice = fWeights[layer - 1];
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
activation += neuronOutputSlice[inputNeuron] * fWeightsSlice[inputNeuron][neuron];
}
It is possible that the server JIT performs a similar code invariant movement, the only way to find out is change and profile it. On the client JIT this should improve performance no matter what.
Another thing you can try is to precalculate the for-loop exit conditions, like this:
for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) { ... }
// transform to precalculated exit condition (move invariant array access outside loop)
for (int neuron = 0, neuronCount = neuronCounts[layer]; neuron < neuronCount; neuron++) { ... }
Again the JIT may already do this for you, so profile if it helps.
Is there a point to multiplying with 1.0F that eludes me here?:
float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];
Other things that could potentially improve speed at cost of readability: inline sigmoid() function manually (the JIT has a very tight limit for inlining and the function might be larger).
It can be slightly faster to run a loop backwards (where it doesnt change the outcome of course), since testing the loop index against zero is a little cheaper than checking against a local variable (the innermost loop is a potentical candidate again, but dont expect the output to be 100% identical in all cases, since adding floats a + b + c is potentially not the same as a + c + b).
For a start, don't do this:
// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
neuronOutput[0][i] = input[i];
}
But this:
System.arraycopy( input, 0, neuronOutput[0], 0, inputNeurons );
First thing I would look into is seeing if Math.exp is slowing you down. See this post on a Math.exp approximation for a native alternative.
Replace the expensive floating point sigmoid transfer function with an integer step transfer function.
The sigmoid transfer function is a model of organic analog synaptic learning, which in turn seems to be a model of a step function.
The historical precedent for this is that Hinton designed the back-prop algorithm directly from the first principles of cognitive science theories about real synapses, which in turn were based on real analog measurements, which turn out to be sigmoid.
But the sigmoid transfer function seems to be an organic model of the digital step function, which of course cannot be directly implemented organically.
Rather than model a model, replace the expensive floating point implementation of the organic sigmoid transfer function with the direct digital implementation of a step function (less than zero = -1, greater than zero = +1).
The brain cannot do this, but backprop can!
This not only linearly and drastically improves performance of a single learning iteration, it also reduces the number of learning iterations required to train the network: supporting evidence that learning is inherently digital.
Also supports the argument that Computer Science is inherently cool.
Purely based upon code inspection, your inner most loop has to compute references to a three-dimensional parameter and its being done a lot. Depending upon your array dimensions could you possibly be having cache issues due to have to jump around memory with each loop iteration. Maybe you could rearrange the dimensions so the inner loop tries to access memory elements that are closer to one another than they are now?
In any case, profile your code before making any changes and see where the real bottleneck is.
I suggest using a fixed point system rather than a floating point system. On almost all processors using int is faster than float. The simplest way to do this is simply shift everything left by a certain amount (4 or 5 are good starting points) and treat the bottom 4 bits as the decimal.
Your innermost loop is doing floating point maths so this may give you quite a boost.
The key to optimization is to first measure where the time is spent. Surround various parts of your algorithm with calls to System.nanoTime():
long start_time = System.nanoTime();
doStuff();
long time_taken = System.nanoTime() - start_time;
I'd guess that while using System.arraycopy() would help a bit, you'll find your real costs in the inner loop.
Depending on what you find, you might consider replacing the float arithmetic with integer arithmetic.
this question is just speculative.
I have the following implementation in C++:
using namespace std;
void testvector(int x)
{
vector<string> v;
char aux[20];
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
{
sprintf(aux, "%d", i);
v.push_back(s + aux);
}
}
int main()
{
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i);
}
}
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I know the Java HotSpot performs a lot of optimizations when translating to native, but I think if such performance can be done in Java, it could be implemented in C++ too...
So, what do you think that should be modified in the program above or, I dunno, in the libraries used or in the memory allocator to reach similar performances in this stuff? (writing actual code of these things can be very long, so, discussing about it would be great)...
Thank you.
You have to be careful with performance tests because it's very easy to deceive yourself or not compare like with like.
However, I've seen similar results comparing C# with C++, and there are a number of well-known blog posts about the astonishment of native coders when confronted with this kind of evidence. Basically a good modern generational compacting GC is very much more optimised for lots of small allocations.
In C++'s default allocator, every block is treated the same, and so are averagely expensive to allocate and free. In a generational GC, all blocks are very, very cheap to allocate (nearly as cheap as stack allocation) and if they turn out to be short-lived then they are also very cheap to clean up.
This is why the "fast performance" of C++ compared with more modern languages is - for the most part - mythical. You have to hand tune your C++ program out of all recognition before it can compete with the performance of an equivalent naively written C# or Java program.
All your program does is print the numbers 0..9000 in steps of 1000. The calls to testvector() do nothing and can be eliminated. I suspect that your JVM notices this, and is essentially optimising the whole function away.
You can achieve a similar effect in your C++ version by just commenting out the call to testvector()!
Well, this is a pretty useless test that only measures allocation of small objects.
That said, simple changes made me get the running time down from about 15 secs to about 4 secs. New version:
typedef vector<string, boost::pool_allocator<string> > str_vector;
void testvector(int x, str_vector::iterator it, str_vector::iterator end)
{
char aux[25] = "X-";
int a = x * 2000;
for (; it != end; ++a)
{
sprintf(aux+2, "%d", a);
*it++ = aux;
}
}
int main(int argc, char** argv)
{
str_vector v(2000);
for (int i = 0; i < 10000; i++)
{
if (i % 1000 == 0) cout << i << endl;
testvector(i, v.begin(), v.begin()+2000);
}
return 0;
}
real 0m4.089s
user 0m3.686s
sys 0m0.000s
Java version has the times:
real 0m2.923s
user 0m2.490s
sys 0m0.063s
(This is my direct java port of your original program, except it passes the ArrayList as a parameter to cut down on useless allocations).
So, to sum up, small allocations are faster on java, and memory management is a bit more hassle in C++. But we knew that already :)
Hotspot optimises hot spots in code. Typically, anything that gets executed 10000 times it tries to optimise.
For this code, after 5 iterations it will try and optimise the inner loop adding the strings to the vector. The optimisation it will do more than likely will include escape analyi o the variables in the method. A the vector is a local variable and never escapes local context, it is very likely that it will remove all of the code in the method and turn it into a no op. To test this, try returning the results from the method. Even then, be careful to do something meaningful with the result - just getting it's length for example can be optimised as horpsot can see the result is alway the same a s the number of iterations in the loop.
All of this points to the key benefit of a dynamic compiler like hotspot - using runtime analysis you can optimise what is actually being done at runtime and get rid of redundant code. After all, it doesn't matter how efficient your custom C++ memory allocator is - not executing any code is always going to be faster.
In my box, this program gets executed in approx. 12 seconds; amazingly, I have a similar implementation in Java [using String and ArrayList] and it runs lot faster than my C++ application (approx. 2 seconds).
I cannot reproduce that result.
To account for the optimization mentioned by Alex, I’ve modified the codes so that both the Java and the C++ code printed the last result of the v vector at the end of the testvector method.
Now, the C++ code (compiled with -O3) runs about as fast as yours (12 sec). The Java code (straightforward, uses ArrayList instead of Vector although I doubt that this would impact the performance, thanks to escape analysis) takes about twice that time.
I did not do a lot of testing so this result is by no means significant. It just shows how easy it is to get these tests completely wrong, and how little single tests can say about real performance.
Just for the record, the tests were run on the following configuration:
$ uname -ms
Darwin i386
$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
$ g++ --version
i686-apple-darwin9-g++-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)
It should help if you use Vector::reserve to reserve space for z elements in v before the loop (however the same thing should also speed up the java equivalent of this code).
To suggest why the performance both C++ and java differ it would essential to see source for both, I can see a number of performance issues in the C++, for some it would be useful to see if you were doing the same in the java (e.g. flushing the output stream via std::endl, do you call System.out.flush() or just append a '\n', if the later then you've just given the java a distinct advantage)?
What are you actually trying to measure here? Putting ints into a vector?
You can start by pre-allocating space into the vector with the know size of the vector:
instead of:
void testvector(int x)
{
vector<string> v;
int a = x * 2000;
int z = a + 2000;
string s("X-");
for (int i = a; i < z; i++)
v.push_back(i);
}
try:
void testvector(int x)
{
int a = x * 2000;
int z = a + 2000;
string s("X-");
vector<string> v(z);
for (int i = a; i < z; i++)
v.push_back(i);
}
In your inner loop, you are pushing ints into a string vector. If you just single-step that at the machine-code level, I'll bet you find that a lot of that time goes into allocating and formatting the strings, and then some time goes into the pushback (not to mention deallocation when you release the vector).
This could easily vary between run-time-library implementations, based on the developer's sense of what people would reasonably want to do.