Java: micro-optimizing array manipulation

Java: micro-optimizing array manipulation - java

I am trying to make a Java port of a simple feed-forward neural network.
This obviously involves lots of numeric calculations, so I am trying to optimize my central loop as much as possible. The results should be correct within the limits of the float data type.
My current code looks as follows (error handling & initialization removed):
/**
* Simple implementation of a feedforward neural network. The network supports
* including a bias neuron with a constant output of 1.0 and weighted synapses
* to hidden and output layers.
*
* #author Martin Wiboe
*/
public class FeedForwardNetwork {
private final int outputNeurons; // No of neurons in output layer
private final int inputNeurons; // No of neurons in input layer
private int largestLayerNeurons; // No of neurons in largest layer
private final int numberLayers; // No of layers
private final int[] neuronCounts; // Neuron count in each layer, 0 is input
// layer.
private final float[][][] fWeights; // Weights between neurons.
// fWeight[fromLayer][fromNeuron][toNeuron]
// is the weight from fromNeuron in
// fromLayer to toNeuron in layer
// fromLayer+1.
private float[][] neuronOutput; // Temporary storage of output from previous layer
public float[] compute(float[] input) {
// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
neuronOutput[0][i] = input[i];
}
// Loop through layers
for (int layer = 1; layer < numberLayers; layer++) {
// Loop over neurons in the layer and determine weighted input sum
for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) {
// Bias neuron is the last neuron in the previous layer
int biasNeuron = neuronCounts[layer - 1];
// Get weighted input from bias neuron - output is always 1.0
float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];
// Get weighted inputs from rest of neurons in previous layer
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][inputNeuron][neuron];
}
// Store neuron output for next round of computation
neuronOutput[layer][neuron] = sigmoid(activation);
}
}
// Return output from network = output from last layer
float[] result = new float[outputNeurons];
for (int i = 0; i < outputNeurons; i++)
result[i] = neuronOutput[numberLayers - 1][i];
return result;
}
private final static float sigmoid(final float input) {
return (float) (1.0F / (1.0F + Math.exp(-1.0F * input)));
}
}
I am running the JVM with the -server option, and as of now my code is between 25% and 50% slower than similar C code. What can I do to improve this situation?
Thank you,
Martin Wiboe
Edit #1: After seeing the vast amount of responses, I should probably clarify the numbers in our scenario. During a typical run, the method will be called about 50.000 times with different inputs. A typical network would have numberLayers = 3 layers with 190, 2 and 1 neuron, respectively. The innermost loop will therefore have about 2*191+3=385 iterations (when counting the added bias neuron in layers 0 and 1)
Edit #1: After implementing the various suggestions in this thread, our implementation is practically as fast as the C version (within ~2 %). Thanks for all the help! All of the suggestions have been helpful, but since I can only mark one answer as the correct one, I will give it to #Durandal for both suggesting array optimizations and being the only one to precalculate the for loop header.

Some tips.
in your inner most loop, think about how you are traversing your CPU cache and re-arrange your matrix so you are accessing the outer most array sequentially. This will result in you accessing your cache in order rather than jumping all over the place. A cache hit can be two orders of magniture faster than a cache miss.
e.g restructure fWeights so it is accessed as
activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][neuron][inputNeuron];
don't perform work inside the loop (every time) which can be done outside the loop (once). Don't perform the [layer -1] lookup every time when you can place this in a local variable. Your IDE should be able to refactor this easily.
multi-dimensional arrays in Java are not as efficient as they are in C. They are actually multiple layers of single dimensional arrays. You can restructure the code so you're only using a single dimensional array.
don't return a new array when you can pass the result array as an argument. (Saves creating a new object on each call).
rather than peforming layer-1 all over the place, why not use layer1 as layer-1 and using layer1+1 instead of layer.

Disregarding the actual math, the array indexing in Java can be a performance hog in itself. Consider that Java has no real multidimensional arrays, but rather implements them as array of arrays. In your innermost loop, you access over multiple indices, some of which are in fact constant in that loop. Part of the array access can be move outside of the loop:
final int[] neuronOutputSlice = neuronOutput[layer - 1];
final int[][] fWeightSlice = fWeights[layer - 1];
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
activation += neuronOutputSlice[inputNeuron] * fWeightsSlice[inputNeuron][neuron];
}
It is possible that the server JIT performs a similar code invariant movement, the only way to find out is change and profile it. On the client JIT this should improve performance no matter what.
Another thing you can try is to precalculate the for-loop exit conditions, like this:
for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) { ... }
// transform to precalculated exit condition (move invariant array access outside loop)
for (int neuron = 0, neuronCount = neuronCounts[layer]; neuron < neuronCount; neuron++) { ... }
Again the JIT may already do this for you, so profile if it helps.
Is there a point to multiplying with 1.0F that eludes me here?:
float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];
Other things that could potentially improve speed at cost of readability: inline sigmoid() function manually (the JIT has a very tight limit for inlining and the function might be larger).
It can be slightly faster to run a loop backwards (where it doesnt change the outcome of course), since testing the loop index against zero is a little cheaper than checking against a local variable (the innermost loop is a potentical candidate again, but dont expect the output to be 100% identical in all cases, since adding floats a + b + c is potentially not the same as a + c + b).

For a start, don't do this:
// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
neuronOutput[0][i] = input[i];
}
But this:
System.arraycopy( input, 0, neuronOutput[0], 0, inputNeurons );

First thing I would look into is seeing if Math.exp is slowing you down. See this post on a Math.exp approximation for a native alternative.

Replace the expensive floating point sigmoid transfer function with an integer step transfer function.
The sigmoid transfer function is a model of organic analog synaptic learning, which in turn seems to be a model of a step function.
The historical precedent for this is that Hinton designed the back-prop algorithm directly from the first principles of cognitive science theories about real synapses, which in turn were based on real analog measurements, which turn out to be sigmoid.
But the sigmoid transfer function seems to be an organic model of the digital step function, which of course cannot be directly implemented organically.
Rather than model a model, replace the expensive floating point implementation of the organic sigmoid transfer function with the direct digital implementation of a step function (less than zero = -1, greater than zero = +1).
The brain cannot do this, but backprop can!
This not only linearly and drastically improves performance of a single learning iteration, it also reduces the number of learning iterations required to train the network: supporting evidence that learning is inherently digital.
Also supports the argument that Computer Science is inherently cool.

Purely based upon code inspection, your inner most loop has to compute references to a three-dimensional parameter and its being done a lot. Depending upon your array dimensions could you possibly be having cache issues due to have to jump around memory with each loop iteration. Maybe you could rearrange the dimensions so the inner loop tries to access memory elements that are closer to one another than they are now?
In any case, profile your code before making any changes and see where the real bottleneck is.

I suggest using a fixed point system rather than a floating point system. On almost all processors using int is faster than float. The simplest way to do this is simply shift everything left by a certain amount (4 or 5 are good starting points) and treat the bottom 4 bits as the decimal.
Your innermost loop is doing floating point maths so this may give you quite a boost.

The key to optimization is to first measure where the time is spent. Surround various parts of your algorithm with calls to System.nanoTime():
long start_time = System.nanoTime();
doStuff();
long time_taken = System.nanoTime() - start_time;
I'd guess that while using System.arraycopy() would help a bit, you'll find your real costs in the inner loop.
Depending on what you find, you might consider replacing the float arithmetic with integer arithmetic.

Related

java hashing function collision

Attempting to write my own hash function in Java. I'm aware that this is the same one that java implements but wanted to test it out myself. I'm getting collisions when I input different values and am not sure why.
public static int hashCodeForString(String s) {
int m = 1;
int myhash = 0;
for (int i = 0; i < s.length(); i++, m++){
myhash += s.charAt(i) * Math.pow(31,(s.length() - m));
}
return myhash;
}

Kindly remember just how a hash-table (in any language ...) actually works: it consists of a (usually, prime) number of "buckets." The purpose of the hash-function is simply to convert any incoming key-value into a bucket-number. (The worst-case scenario is always that 100% of the incoming keys wind-up in a single bucket, leaving you with "a linked list.") You simply strive to devise a hash-function that will "typically" produce a "widely scattered" distribution of values so that, when calculated modulo the (prime ...) number of buckets, "most of the time, most of the buckets" will be "more-or-less equally" filled. (But remember: you can never be sure.)
"Collisions" are entirely to be expected: in fact, "they happen all the time."
In my humble opinion, you're "over-thinking" the hash-function: I see no compelling reason to use Math.pow() at all. Expect that any value which you produce will be converted to a hash-bucket number by taking its absolute value modulo the number of buckets. The best way to see if you came up with a good one (for your data ...) is to observe the resulting distribution of bucket-sizes. (Is it "good enough" for your purposes yet?)

Efficiency of Multithreaded Loops

Greetings noble community,
I want to have the following loop:
for(i = 0; i < MAX; i++)
A[i] = B[i] + C[i];
This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.
(for simplicity, assume MAX is a multiple of 4)
Option 1:
for(i = tid; i < MAX; i += 4)
A[i] = B[i] + C[i];
Option 2:
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
A[i] = B[i] + C[i];
My question is if there's one that is more efficient then the other and why?

The second one is better than the first one. Simple answer: the second one minimize false sharing
Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.
When would this happen?
Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.
class Foo {
private int foo1;
private int foo2;
}
How bad is false sharing?
I refer Example 6 from the Gallery of Processor Cache Effects
private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
s_counter[position] = s_counter[position] + 3;
}
}
On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done.
On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!
How to detect false sharing?
Linux Perf could be used to detect cache misses and therefore help you analysis such problem.
refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:
Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses # 1.54% of all L1-dcache hits [51.24%]
Performance counter stats for './cache_line_test 16 32 48 64':
36,992 L1-dcache-load-misses # 0.01% of all L1-dcache hits [50.51%]
It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.
Is there some good practice in industry?
LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm
The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.
For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:
abstract class RingBufferPad
{
protected long p1, p2, p3, p4, p5, p6, p7;
}
Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:
this.entries = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];
source
You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast

There are two different reasons why you should prefer option 2 over option 1. One of these is cache locality / cache contention, as explained in #qqibrow's answer; I won't explain that here as there's already a good answer explaining it.
The other reason is vectorisation. Most high-end modern processors have vector units which are capable of running the same instruction simultaneously on multiple different data (in particular, if the processor has multiple cores, it almost certainly has a vector unit, maybe even multiple vector units, on each core). For example, without the vector unit, the processor has an instruction to do an addition:
A = B + C;
and the corresponding instruction in the vector unit will do multiple additions at the same time:
A1 = B1 + C1;
A2 = B2 + C2;
A3 = B3 + C3;
A4 = B4 + C4;
(The exact number of additions will vary by processor model; on ints, common "vector widths" include 4 and 8 simultaneous additions, and some recent processors can do 16.)
Your for loop looks like an obvious candidate for using the vector unit; as long as none of A, B, and C are pointers into the same array but with different offsets (which is possible in C++ but not Java), the compiler would be allowed to optimise option 2 into
for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i+=4) {
A[i+0] = B[i+0] + C[i+0];
A[i+1] = B[i+1] + C[i+1];
A[i+2] = B[i+2] + C[i+2];
A[i+3] = B[i+3] + C[i+3];
}
However, one limitation of the vector unit is related to memory accesses: vector units are only fast at accessing memory when they're accessing adjacent locations (such as adjacent elements in an array, or adjacent fields of a C struct). The option 2 code above is pretty much the best case for vectorisation of the code: the vector unit can access all the elements it needs from each array as a single block. If you tried to vectorise the option 1 code, the vector unit would take so long trying to find all the values it's working on in memory that the gains from vectorisation would be negated; it would be unlikely to run any faster than the non-vectorised code, because the memory access wouldn't be any faster, and the addition takes no time by comparison (because the processor can do the addition while it's waiting for the values to arrive from memory).
It isn't guaranteed that a compiler will be able to make use of the vector unit, but it would be much more likely to do so with option 2 than option 1. So you might find that option 2's advantage over option 1 is a factor of 4/8/16 more than you'd expect if you only took cache effects into account.

Generate Random numbers without using any external functions

This was questions asked in one of the interviews that I recently attended.
As far as I know a random number between two numbers can be generated as follows
public static int rand(int low, int high) {
return low + (int)(Math.random() * (high - low + 1));
}
But here I am using Math.random() to generate a random number between 0 and 1 and using that to help me generate between low and high. Is there any other way I can directly do without using external functions?

Typical pseudo-random number generators calculate new numbers based on previous ones, so in theory they are completely deterministic. The only randomness is guaranteed by providing a good seed (initialization of the random number generation algorithm). As long as the random numbers aren't very security critical (this would require "real" random numbers), such a recursive random number generator often satisfies the needs.
The recursive generation can be expressed without any "external" functions, once a seed was provided. There are a couple of algorithms solving this problem. A good example is the Linear Congruential Generator.
A pseudo-code implementation might look like the following:
long a = 25214903917; // These Values for a and c are the actual values found
long c = 11; // in the implementation of java.util.Random(), see link
long previous = 0;
void rseed(long seed) {
previous = seed;
}
long rand() {
long r = a * previous + c;
// Note: typically, one chooses only a couple of bits of this value, see link
previous = r;
return r;
}
You still need to seed this generator with some initial value. This can be done by doing one of the following:
Using something like the current time (good in most non-security-critical cases like games)
Using hardware noise (good for security-critical randomness)
Using a constant number (good for debugging, since you get always the same sequence)
If you can't use any function and don't want to use a constant seed, and if you are using a language which allows this, you could also use some uninitialized memory. In C and C++ for example, define a new variable, don't assign something to it and use its value to seed the generator. But note that this is far from being a "good seed" and only a hack to fulfill your requirements. Never use this in real code.
Note that there is no algorithm which can generate different values for different runs with the same inputs without access to some external sources like the system environment. Every well-seeded random number generator makes use of some external sources.

Here I am suggesting some sources with comment may be you find helpful:
System Time : Monotonic in a day poor random. Fast, Easy.
Mouse Point : Random But not useful on standalone system.
Raw Socket/ Local Network
(Packet 's info-part ) : Good Random Technical and time consuming - Possible to model a attack mode to reduce randomness.
Some input text with permutation : Fast, Common way and good too (in my opinion).
Timing of the Interrupt due to keyboard, disk-drive and other events: Common way – error prone if not used carefully.
Another approach is to feed an analog noise signal : example like temp.
/proc file data: On Linux system. I feel you should use this.
/proc/sys/kernel/random:
This directory contains various parameters controlling the operation of the file /dev/random.
The character special files /dev/random and /dev/urandom (present since Linux
1.3.30) provide an interface to the kernel's random number generator.
try this commads:
$cat /dev/urandom
and
$cat /dev/random
You can write a file read function that read from this file.
Read (also suggests): Is a rand from /dev/urandom secure for a login key?
`

Does System.currentTimeMillis() count as external? You could always get this and calculate mod by some max value:
int rand = (int)(System.currentTimeMillis()%high)+low;

You can get near randomness (actually chaotic and definitely not uniform*) from the logistic map x = 4x(1-x) starting with a "non-rational" x between 0 and 1.
The "randomness" appears because of the rounding errors at the edge of the accuracy of the floating point representation.
(*)You can undo the skewing once you know it is there.

You may use the address of a variable or combine the address of more variables to make a more complex one...

You could get the current system time, but that would also require a function in most languages.

You can do it without external functions if you are allowed to use some external state (e.g. a long initialised with the current system time). This is enough for you to implement a simple psuedo-random number generator.
In each call to your random function, you would use the state to create a new random value, and update the state, so that subsequent calls get different results.
You can do this with just regular Java arithmetic and/or bitwise operations, so no external functions are required.

public class randomNumberGenerator {
int generateRandomNumber(int min, int max) {
return (int) ((System.currentTimeMillis() % max) + min);
}
public static void main(String[] args) {
randomNumberGenerator rn = new randomNumberGenerator();
int cv = 0;
int min = 1, max = 4;
Map<Integer, Integer> hmap = new HashMap<Integer, Integer>();
int count = min;
while (count <= max) {
cv = rn.generateRandomNumber(min, max);
if ((hmap.get(cv) == null) && cv >= min && cv <= max) {
System.out.print(cv + ",");
hmap.put(cv, 1);
count++;
}
}
}
}

Poisson Random Generator
Lets say we start with an expected value 'v' of the random numbers. Then to say that a sequence of non negative integers satisfies a Poisson Distribution with expected value v means that over subsequences, the mean(average) of the value will appear 'v'.
Poisson Distribution is part of statistics and the details can be found on wikipedia.
But here the main advantage of using this function are:
1. Only integer values are generated.
2. The mean of those integers will be equal to the value we initially provided.
It is helpful in applications where fractional values don't make sense. Like number of planes arriving on an airport in 1min is 2.5(doesn't make sense) but it implies that in 2 mins 5 plans arrive.
int poissonRandom(double expectedValue) {
int n = 0; //counter of iteration
double limit;
double x; //pseudo random number
limit = exp(-expectedValue);
x = rand() / INT_MAX;
while (x > limit) {
n++;
x *= rand() / INT_MAX;
}
return n;
}
The line
rand() / INT_MAX
should generate a random number between 0 and 1. So we can use time of the system.
Seconds / 60 will serve the purpose.
Which function we should use is totally application dependent.

Neural Network Back-Propagation Algorithm Gets Stuck on XOR Training PAttern

Overview
So I'm trying to get a grasp on the mechanics of neural networks. I still don't totally grasp the math behind it, but I think I understand how to implement it. I currently have a neural net that can learn AND, OR, and NOR training patterns. However, I can't seem to get it to implement the XOR pattern. My feed forward neural network consists of 2 inputs, 3 hidden, and 1 output. The weights and biases are randomly set between -0.5 and 0.5, and outputs are generated with the sigmoidal activation function
Algorithm
So far, I'm guessing I made a mistake in my training algorithm which is described below:
For each neuron in the output layer, provide an error value that is the desiredOutput - actualOutput --go to step 3
For each neuron in a hidden or input layer (working backwards) provide an error value that is the sum of all forward connection weights * the errorGradient of the neuron at the other end of the connection --go to step 3
For each neuron, using the error value provided, generate an error gradient that equals output * (1-output) * error. --go to step 4
For each neuron, adjust the bias to equal current bias + LEARNING_RATE * errorGradient. Then adjust each backward connection's weight to equal current weight + LEARNING_RATE * output of neuron at other end of connection * this neuron's errorGradient
I'm training my neural net online, so this runs after each training sample.
Code
This is the main code that runs the neural network:
private void simulate(double maximumError) {
int errorRepeatCount = 0;
double prevError = 0;
double error; // summed squares of errors
int trialCount = 0;
do {
error = 0;
// loop through each training set
for(int index = 0; index < Parameters.INPUT_TRAINING_SET.length; index++) {
double[] currentInput = Parameters.INPUT_TRAINING_SET[index];
double[] expectedOutput = Parameters.OUTPUT_TRAINING_SET[index];
double[] output = getOutput(currentInput);
train(expectedOutput);
// Subtracts the expected and actual outputs, gets the average of those outputs, and then squares it.
error += Math.pow(getAverage(subtractArray(output, expectedOutput)), 2);
}
} while(error > maximumError);
Now the train() function:
public void train(double[] expected) {
layers.outputLayer().calculateErrors(expected);
for(int i = Parameters.NUM_HIDDEN_LAYERS; i >= 0; i--) {
layers.allLayers[i].calculateErrors();
}
}
Output layer calculateErrors() function:
public void calculateErrors(double[] expectedOutput) {
for(int i = 0; i < numNeurons; i++) {
Neuron neuron = neurons[i];
double error = expectedOutput[i] - neuron.getOutput();
neuron.train(error);
}
}
Normal (Hidden & Input) layer calculateErrors() function:
public void calculateErrors() {
for(int i = 0; i < neurons.length; i++) {
Neuron neuron = neurons[i];
double error = 0;
for(Connection connection : neuron.forwardConnections) {
error += connection.output.errorGradient * connection.weight;
}
neuron.train(error);
}
}
Full Neuron class:
package neuralNet.layers.neurons;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import neuralNet.Parameters;
import neuralNet.layers.NeuronLayer;
public class Neuron {
private double output, bias;
public List<Connection> forwardConnections = new ArrayList<Connection>(); // Forward = layer closer to input -> layer closer to output
public List<Connection> backwardConnections = new ArrayList<Connection>(); // Backward = layer closer to output -> layer closer to input
public double errorGradient;
public Neuron() {
Random random = new Random();
bias = random.nextDouble() - 0.5;
}
public void addConnections(NeuronLayer prevLayer) {
// This is true for input layers. They create their connections differently. (See InputLayer class)
if(prevLayer == null) return;
for(Neuron neuron : prevLayer.neurons) {
Connection.createConnection(neuron, this);
}
}
public void calcOutput() {
output = bias;
for(Connection connection : backwardConnections) {
connection.input.calcOutput();
output += connection.input.getOutput() * connection.weight;
}
output = sigmoid(output);
}
private double sigmoid(double output) {
return 1 / (1 + Math.exp(-1*output));
}
public double getOutput() {
return output;
}
public void train(double error) {
this.errorGradient = output * (1-output) * error;
bias += Parameters.LEARNING_RATE * errorGradient;
for(Connection connection : backwardConnections) {
// for clarification: connection.input refers to a neuron that outputs to this neuron
connection.weight += Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
}
}
}
Results
When I'm training for AND, OR, or NOR the network can usually converge within about 1000 epochs, however when I train with XOR, the outputs become fixed and it never converges. So, what am I doing wrong? Any ideas?
Edit
Following the advice of others, I started over and implemented my neural network without classes...and it works. I'm still not sure where my problem lies in the above code, but it's in there somewhere.

This is surprising because you are using a big enough network (barely) to learn XOR. Your algorithm looks right, so I dont really know what is going on. It might help to know how you generate your training data: are you just reating the samples (1,0,1),(1,1,0),(0,1,1),(0,0,0) or something like that over and over? Perhaps the problem is that stochastic gradient descent is causing you to jump around the stabilizing minima. You could try some things to fix this: perhaps randomly sample from your training examples instead of repeating them (if that is what you are doing). Or, alternatively, you could modify your learning algorithm:
currently you have something equivalent to:
weight(epoch) = weight(epoch - 1) + deltaWeight(epoch)
deltaWeight(epoch) = mu * errorGradient(epoch)
where mu is the learning rate
One option is to very slowly decrease the value of mu.
An alternative would be to change your definition of deltaWeight to include a "momentum"
deltaWeight(epoch) = mu * errorGradient(epoch) + alpha * deltaWeight(epoch -1)
where alpha is the momentum parameter (between 0 and 1).
Visually, you can think of gradient descent as trying to find the minimum point of a curved surface by placing an object on that surface, and then step by step moving that object small amounts in which ever directing is sloping down based on where it is currently located. The problem is that you dont really do gradient descent: instead you do stochastic gradient descent where you move in direction by sampling from a set of training vectors and moving in what ever direction the sample makes look like is down. On average over the entire training data, stochastic gradient descent should work, but it is isn't guaranteed to because you can get into a situation where you jump back and forth never making progress. Slowly decreasing the learning rate means you take smaller and smaller steps each time so can not get stuck in an infinite cycle.
On the other hand, momentum makes the algorithm into something akin to rolling a rubber ball. As the ball roles it tends to go in the downwards direction, but it also tends to keep going in the direction it was going before, and if it is ever on a stretch where the down slope is in the same direction for a while it will speed up. The ball will therefore jump over some local minima, and it will be more resilient against stepping back and forth over the target because doing so means working against the force of momentum.
Having some code and thinking about this some more, it is pretty clear that your problem is in training the early layers. The functions you have successfully learned are all linearly separable, so it would make sense that only a single layer is being properly learned. I agree with LiKao about implementation strategies in general, although your approach should work. My suggestion for how to debug this is figure out what the progression of the weights on the connections between the input layer and the output layer looks like.
You should post the rest implementation of Neuron.

I faced the same problem short time ago. Finally I found the solution, how to write a code solving XOR wit the MLP algorithm.
The XOR problem seems to be an easy to learn problem but it isn't for the MLP because it's not linearly separable. So even if your MLP is OK (I mean there is no bug in your code) you have to find the good parameters to be able to learn the XOR problem.
Two hidden and one output neuron is fine. The 2 main thing you have to set:
although you have only 4 training samples you have to run the training for a couple of thousands epoch.
if you use sigmoid hidden layers but linear output the network will converge faster
Here is the detailed description and sample code: http://freeconnection.blogspot.hu/2012/09/solving-xor-with-mlp.html

Small hint - if the output of your NN seem to drift toward 0.5 then everything's OK!
The algorithm using just the learning rate and bias is just too simple to quickly learn XOR. You can either increase the number of epochs or change the algorithm.
My recommendation is to use momentum:
1000 epochs
learningRate = 0.3
momentum = 0.8
weights drawn from [0,1]
bias drawn form [-0.5, 0.5]
And some crucial pseudo code (assuming back and forward propagation works) :
for every edge:
previous_edge_weight_change = -1 * learningRate * edge_source_neuron_value * edge_target_neuron_delta + previous_edge_weight * momentum
edge_weight += previous_edge_weight_change
for every neuron:
previous_neuron_bias_change = -1 * learningRate * neuron_delta + previous_neuron_bias_change * momentum
bias += previous_neuron_bias_change

I would suggest you to generate a grid (say from [-5,-5] to [5,5] with a step like 0.5), learn your MLP on the XOR and apply it to the grid. Plotted in color you could see some kind of a frontier.
If you do that at each iteration, you'll see the evolution of the frontier and can control the learning.

It's been a while since I last implemented an Neural Network myself, but I think your mistake is in the lines:
bias += Parameters.LEARNING_RATE * errorGradient;
and
connection.weight += Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
The first of these lines should not be there at all. Bias is best modeled as the input of a neuron which is fixed at 1. This will serve to make your code a lot simpler and cleaner, because you will not have to treat the bias in any special way.
The other point is, that I think the sign is wrong in both of these expressions. Think about it like this:
Your gradient points into the direction of steepest ascend, so if you go into that direction, your error will get larger.
What you are doing here is adding something to the weights, in case the error is already positive, i.e. you are making it more positive. If it is negative you are substracting someting, i.e. you make it more negative.
Unless I am missing something about your definition of error or the gradient calculation you should change these lines to:
bias -= Parameters.LEARNING_RATE * errorGradient;
and
connection.weight -= Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
I had a similar mistake in one of my early implementations and it lead to exactly the same behaviour, i.e. it resulted in a network that learned in simple cases, but not anymore once the training data became more complex.

LiKao's comment to simplify my implementation and get rid of the object-oriented aspects solved my problem. The flaw in the algorithm as it is described above is unknown, however I now have a working neural network that is a lot smaller.
Feel free to continue to provide insight on the problem with my previous implementation, as others may have the same problem in the future.

I'm a bit rusty on neural networks, but I think there was a problem to implement the XOR with one perceptron: basically a neuron is able to separate two groups of solutions through a straight line, but one straight line is not sufficient for the XOR problem...
Here should be the answer!

I couldn't see anything wrong with the code, but I was having a similar problem with my network not converging for XOR, so figured I'd post my working configuration.
3 input neurons (one of them being a fixed bias of 1.0)
3 hidden neurons
1 output neuron
Weights randomly chosen between -0.5 and 0.5.
Sigmoid activation function.
Learning rate = 0.2
Momentum = 0.4
Epochs = 50,000
Converged 10/10 times.
One of the mistakes I was making was not connecting the bias input to the output neuron, and this would mean for the same configuration it only converged 2 out of 10 times with the other eight times failing because 1 and 1 would output 0.5.
Another mistake was not doing enough epochs. If I only did 1000 then the outputs tend to be around 0.5 for every test case. With epochs >= 8000 so 2000 times for each test case, it started to look like it might be working (but only if using momentum).
When doing 50000 epochs it did not matter whether momentum was used or not.
Another thing I tried was to not apply the sigmoid function to the output neurons output (which I think was what an earlier post had suggested), but this wrecked the network because the output*(1-output) part of the error equation could now be negative meaning weights were updated in a way that caused the error to increase.

What is the fastest way to compute an epsilon closure?

I am working on a program to convert Non-deterministic finite state automata (NFAs) to Deterministic finite state automata (DFAs). To do this, I have to compute the epsilon closure of every state in the NFA that has an epsilon transition. I have already figured out a way to do this, but I always assume that the first thing I think of is usually the least efficient way to do something.
Here is an example of how I would compute a simple epsilon closure:
Input strings for transition function: format is startState, symbol = endState
EPS is an epsilon transition
1, EPS = 2
Results in the new state { 12 }
Now obviously this is a very simple example. I would need to be able to compute any number of epsilon transitions from any number of states. To this end, my solution is a recursive function that computes the epsilon closure on the given state by looking at the state it has an epsilon transition into. If that state has (an) epsilon transition(s) then the function is called recursively within a for loop for as many epsilon transitions as it has. This will get the job done but probably isn't the fastest way. So my question is this: what is the fastest way to compute an epsilon closure in Java?

Depth first search (or breadth first search - doesn't really matter) over the graph whose edges are your epilson transitions. So in other words, your solution is optimal provided you efficiently track which states you've already added to the closure.

JFLAP does this. You can see their source - specifically ClosureTaker.java. It's a depth-first search (which is what Peter Taylor suggested), and since JFLAP uses it I assume that's the near-optimal solution.

Did you look into an algorithm book? But I doubt you'll find a significantly better approach. But the actual performance of this algorithm may very well depend on the concrete data structure you use to implement your graph. And you can share work, depending on the order you simplify your graph. Think about subgraphs which are epsilon connected and are referenced from two different nodes.
I am not sure whether this can be done in an optimal way, or whether you have to resort to some heuristics.
Scan the the literature on algorithms.

Just so that people looking only for the specific snippet of code referenced by #Xodarap 's answer don't find themselves in the need of downloading both the source code and an application to view the code of the jar file, I took the liberty to attach said snippet.
public static State[] getClosure(State state, Automaton automaton) {
List<State> list = new ArrayList<>();
list.add(state);
for (int i = 0; i < list.size(); i++) {
state = (State) list.get(i);
Transition transitions[] = automaton.getTransitionsFromState(state);
for (int k = 0; k < transitions.length; k++) {
Transition transition = transitions[k];
LambdaTransitionChecker checker = LambdaCheckerFactory
.getLambdaChecker(automaton);
/** if lambda transition */
if (checker.isLambdaTransition(transition)) {
State toState = transition.getToState();
if (!list.contains(toState)) {
list.add(toState);
}
}
}
}
return (State[]) list.toArray(new State[0]);
}
It goes without saying that all credit goes to #Xodarap and the JFLAP project.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.