(Java) Partial Derivatives for Back Propagation of Hidden Layer

(Java) Partial Derivatives for Back Propagation of Hidden Layer - java

Yesterday I posted a question about the first piece of the Back propagation aglorithm.
Today I'm working to understand the hidden layer.
Sorry for a lot of questions, I've read several websites and papers on the subject, but no matter how much I read, I still have a hard time applying it to actual code.
This is the code that I'm analyzing (I'm working in Java so its nice to look at a Java example)
// update weights for the hidden layer
for (Neuron n : hiddenLayer) {
ArrayList<Connection> connections = n.getAllInConnections();
for (Connection con : connections) {
double output = n.getOutput();
double ai = con.leftNeuron.getOutput();
double sumKoutputs = 0;
int j = 0;
for (Neuron out_neu : outputLayer) {
double wjk = out_neu.getConnection(n.id).getWeight();
double desiredOutput = (double) expectedOutput[j];
double ak = out_neu.getOutput();
j++;
sumKoutputs = sumKoutputs
+ (-(desiredOutput - ak) * ak * (1 - ak) * wjk);
}
double partialDerivative = output * (1 - output) * ai * sumKoutputs;
double deltaWeight = -learningRate * partialDerivative;
double newWeight = con.getWeight() + deltaWeight;
con.setDeltaWeight(deltaWeight);
con.setWeight(newWeight + momentum * con.getPrevDeltaWeight());
}
}
One real problem, here, is that I don't know how all of the methods work exactly.
This code is going through all neurons in the hidden layer, and going through each connection to each neuron in the hidden layer one by one. It grabs each of the connection's output? So, this is the summation of incoming connections (run through a Sig function probably) and then * by a connection weight? Then "double ai" is getting the input connection values to this particular node? Is it getting just one or the sum of the input to the neuron?
Then a third for loop pretty much sums up a "out_neu.getConnection(n.id).getWeight()" which I don't quite understand. Then, the desired output is the desiredOutput for the final layer node? Then ak is the actual output (summation and activation function) of each node or is it the summation+activation*weight?
EDIT
I started working on my own code, can anyone take a look at it?
public class BackProp {
public int layers = 3;
public int hiddenNeuronsNum = 5;
public int outputNeuronsNum = 1;
public static final double eta = .1;
public double[][][] weights; //holds the network -- weights[layer][neuron][forwardConnetion]
public void Back(){
for(int neuron = 0; neuron < outputNeuronsNum; neuron++){
for(int connection = 0; connection < hiddenNeuronsNum; connection++){
double expOutput = expectedOutput[neuron]; //the expected output from the neuron we're on
double actOutput = actualOutput[neuron];
double previousLayerOutput = holdNeuronValues[layers-1][neuron];
double delta = eta *(actOutput * (1-actOutput) *(expOutput - actOutput)* previousLayerOutput);
weights[layers-1][neuron][connection] += delta; //OKAY M&M said YOU HAD THIS MESSED UP, 3rd index means end neuron, 2nd means start.. moving from left to right
}
}
//Hidden Layer..
for(int neuron = 0; neuron < outputNeuronsNum; neuron++){
for(int connection = 0; connection < hiddenNeuronsNum; connection++){
double input = holdNeuronValues[layers-3][connection]; //what this neuron sends on, -2 for the next layer
double output = holdNeuronValues[layers-2][connection];
double sumKoutputs = 0;
//for the output layer
for (int outputNeurons = 0; outputNeurons < weights[layers].length; outputNeurons++) {
double wjk = weights[layers-2][neuron][outputNeurons]; //get the weight
double expOutput = expectedOutput[outputNeurons];
double out = actualOutput[outputNeurons];
sumKoutputs += (-(expOutput - out) * wjk);
}
double partialDerivative = -eta * output * (1 - output) * input * sumKoutputs;
}
}
}
}

This is the standard backpropagation algorithm where it is backpropagating the error through all the hidden layers.
Unless we are in the output layer, the error for a neuron in a hidden layer is dependent on the succeeding layer. Let's assume that we have a particular neuron a with synapses that connect it to neurons i, j, and k in the next layer. Let us also assume that the output of neuron a is oa. Then the error for neuron a is equal to the following expression (assuming we are using the logistic function as the activation function):
δa = oa(1 - oa) × (δiwai + δjwaj + δkwak)
Here, oa(1 - oa) is the value of the derivative of the activation function. δi is the error of neuron i and wai is the weight assigned to the synapse (connection) from i to a; the same applies to the remaining terms.
Notice how we are taking into account the error for each neuron in the next layer that a is connected to. Also notice that we are taking into account the weight accorded to each synapse. Without going into the math, it makes sense intuitively that the error for a is not only dependent on the errors on the neuron that a connects to, but is also dependent on the weights of the synapses (connections) between a and neurons in the next layer.
Once we have the errors, we need to update the weights of the synapses (connections) of every neuron in the previous layer that connects to a (i.e., we backpropagate the error). Let us assume that we have a single neuron z that connects to a. Then we have to adjust wza as follows:
wza = wza + (α × δa × oz)
If there are other neurons (and there probably are) in the previous layer that connect to a, we will update their weights using the same formula as well. Now if you look at your code, you will see that this is exactly what is happening.
You are doing the following for each neuron in the hidden layer:
You are getting a list of synapses (connections) that connect this neuron to the previous layer. This is the connections = n.getAllInConnections() part.
For each connection, the code then does the following:
It gets the output of the neuron (this is the oa term i the formulas above).
It gets the output of the neuron that connects to this neuron (this is the oz term).
Then for each neuron in the output layer, it calculates the sum of the error of each output neuron times the weight from our neuron in the hidden layer, to a neuron in the output layer. Here, sumKoutputs is the same as what we are doing in the expression (δiwai + δjwaj + δkwak). The value of the δi comes from -(desiredOutput - ak) * ak * (1 - ak), since this is how you calculate the error of the output layer; you can simply multiply the derivative of the activation function for the output-layer neuron to the difference between the actual and expected output. Finally, you can see that we multiply that whole thing by wjk; this is the same as the wai term in our formula.
We now have all the values we need to plug into our formula to adjust the weights for every synapse that connects to our neuron from the preceding layer. The problem with the code is that it calculates some things a little differently:
In our formula we have oa(1 - oa) × (δiwai + δjwaj + δkwak) for the error for neuron a. But in the code, it calculates partialDerivative by including ai. In our terms, this would be equivalent to oa(1 - oa) × oz × (δiwai + δjwaj + δkwak). Mathematically it works out because later we end up multiplying this to the learning rate anyway (α × δa × oz), and so it is exactly the same; the difference is just that the code performs the multiplication to oz earlier.
It then calculates deltaWeight, which is (α × δa × oz) in our formula. In the code, α is learningRate.
We then update the weight by adding the delta to the current weight. This is the same as wza + (α × δa × oz).
Now things are a little different. You can see that the code doesn't set the weight directly, but instead deals with momentum. You can see that by using momentum, we add a fraction of the previous delta to the new weight. This is a technique used in neural networks to ensure that the network doesn't get stuck in a local minima. The momentum term gives us a little "push" to get out of a local minima (a "well" in the error-surface; with a neural network we are traversing the error surface to find one with the lowest error, but we could get stuck in a "well" that isn't as "deep" as the optimal solution), and ensures that we can "converge" on a solution. But you have to be careful because if you set this too high, you can overshoot your optimal solution. Once it calculates the new weight using the momentum, it sets it for the connection (synapse).
I hope this explanation made it clearer for you. The math is a little hard to get into, but once you figure it out, it makes sense. I think the main problem here is that the code is written in a slightly different manner. You can take a look at some code here that I wrote that implements the backpropagation algorithm; I did this as a class project. It runs pretty much along the same lines as the formulas I described above and so you should be able to follow through it easily. You can also take a look at this video I made where I explain the backpropagation algorithm.

Related

When do you implement the Sigmoid function in a neural network?

I am getting into some neural networks because it seemed fun. I translated the python code to java and it works like it should I think. It gives me the correct values every time. Although I want to know where do you implement the Sigmoid function in the code. I implemented it after I calculated the output, but even without the Sigmoid function it works the same way.
Website I learned from: https://towardsdatascience.com/first-neural-network-for-beginners-explained-with-code-4cfd37e06eaf
This is my Perceptron function:
public void Perceptron(int input1,int input2,int output) {
double outputP = input1*weights[0]+input2*weights[1]+bias*weights[2];
outputP = Math.floor((1/(1+Math.exp(-outputP))));
if(outputP > 0 ) {
outputP = 1;
}else {
outputP = 0;
}
double error = output - outputP;
weights[0] += error * input1 * learningRate;
weights[1] += error * input2 * learningRate;
weights[2] += error * bias * learningRate;
System.out.println("Output:" + outputP);
}
Also if I don't add the Math.floor() it just gives me a lot of decimals.

Not an expert, but it is used instead of your conditional where you output 1 or 0. That's your threshold function. In that case, you are using a step function; you could replace the whole conditional with your sigmoid function.

Neural networks and large data sets

I have a basic framework for a neural network to recognize numeric digits, but I'm having some problems with training it. My back-propogation works for small data sets, but when I have more than 50 data points, the return value starts converging to 0. And when I have data sets in the thousands, I get NaN's for costs and returns.
Basic structure: 3 layers: 784 : 15 : 1
784 is the number of pixels per data set, 15 neurons in my hidden layer, and one output neuron which returns a value from 0 to 1 (when you multiply by 10 you get a digit).
public class NetworkManager {
int inputSize;
int hiddenSize;
int outputSize;
public Matrix W1;
public Matrix W2;
public NetworkManager(int input, int hidden, int output) {
inputSize = input;
hiddenSize = hidden;
outputSize = output;
W1 = new Matrix(inputSize, hiddenSize);
W2 = new Matrix(hiddenSize, output);
}
Matrix z2, z3;
Matrix a2;
public Matrix forward(Matrix X) {
z2 = X.dot(W1);
a2 = sigmoid(z2);
z3 = a2.dot(W2);
Matrix yHat = sigmoid(z3);
return yHat;
}
public double costFunction(Matrix X, Matrix y) {
Matrix yHat = forward(X);
Matrix cost = yHat.sub(y);
cost = cost.mult(cost);
double returnValue = 0;
int i = 0;
while (i < cost.m.length) {
returnValue += cost.m[i][0];
i++;
}
return returnValue;
}
Matrix yHat;
public Matrix[] costFunctionPrime(Matrix X, Matrix y) {
yHat = forward(X);
Matrix delta3 = (yHat.sub(y)).mult(sigmoidPrime(z3));
Matrix dJdW2 = a2.t().dot(delta3);
Matrix delta2 = (delta3.dot(W2.t())).mult(sigmoidPrime(z2));
Matrix dJdW1 = X.t().dot(delta2);
return new Matrix[]{dJdW1, dJdW2};
}
}
There's the code for network framework. I pass double arrays of length 784 into the forward method.
int t = 0;
while (t < 10000) {
dJdW = Nn.costFunctionPrime(X, y);
Nn.W1 = Nn.W1.sub(dJdW[0].scalar(3));
Nn.W2 = Nn.W2.sub(dJdW[1].scalar(3));
t++;
}
I call this to adjust the weights. With small sets, the cost converges to 0 pretty well, but larger sets don't (the cost associated with 100 characters converges to 13, always). And if the set is too large, the first adjustment works (and costs go down) but after the second all I can get is NaN.
Why does this implementation fail with larger data sets (specifically training) and how can I fix this? I tried a similar structure with 10 outputs instead of 1 where each would return a value near 0 or 1 acting like boolean values, but the same thing was happening.
I'm also doing this in java by the way, and I'm wondering if that has something to do with the problem. I was wondering if it was a problem with running out of space but I haven't been getting any heap space messages. Is there a problem with how I'm back-propogating or is something else happening?
EDIT: I think I know what's happening. I think my backpropogation function is getting caught in local minimums. Sometimes the training succeeds and sometimes it fails for large data sets. Because I'm starting with random weights, I get random initial costs. What I've noticed is that when the cost initially exceeds a certain amount (it depends on the number of datasets involved), the costs converge to a clean number (sometimes 27, others 17.4) and the outputs converge to 0 (which makes sense).
I was warned about relative minimums in the cost function when I began, and I'm beginning to realize why. So now the question becomes, how do I go about my gradient descent so that I'll actually find the global minimum? I'm working in Java by the way.

This seems like a problem with weight initialization.
As far as i can see you never initialize the weights to any specific value. Therefore the network diverges. You should at least use random initialization.

If your backprop works on small dataset is there really good assumtion that there isn't problem. When you're suspicious about it you can try your BP on XOR problem.
Are units biased?
I once discuss with guy who doing exactly same thing. Hand digit recognition and 15 units in hidden layer. I saw a network who doing this task well. Her topology was:
Input: 784
First hidden: 500
Second hidden: 500
Third hidden: 2000
Output: 10
You have a sets of images and you nonlinear transform 784 pixels of image into the 15 numbers from <0, 1> interval and you doing this for all images of your set. You hope that you can right separate digit based on these 15 numbers. From my point of view is 15 hidden unit too little for such a task when I assumed you have dataset with thousands of example. Please try for example 500 hidden units.
And learning rate has influence on backprop and can caused problem with convergence.

Minimization of number of itererations by adjusting input parameters in Java

I was inspired by this question XOR Neural Network in Java
Briefly, a XOR neural network is trained and the number of iterations required to complete the training depends on seven parameters (alpha, gamma3_min_cutoff, gamma3_max_cutoff, gamma4_min_cutoff, gamma4_max_cutoff, gamma4_min_cutoff, gamma4_max_cutoff). I would like to minimize number of iterations required for training by tweaking these parameters.
So, I want to rewrite program from
private static double alpha=0.1, g3min=0.2, g3max=0.8;
int iteration= 0;
loop {
do_something;
iteration++;
if (error < threshold){break}
}
System.out.println( "iterations: " + iteration)
to
for (double alpha = 0.01; alpha < 10; alpha+=0.01){
for (double g3min = 0.01; g3min < 0.4; g3min += 0.01){
//Add five more loops to optimize other parameters
int iteration = 1;
loop {
do_something;
iteration++;
if (error < threshold){break}
}
System.out.println( inputs );
//number of iterations, alpha, cutoffs,etc
//Close five more loops here
}
}
But this brute forcing method is not going to be efficient. Given 7 parameters and hundreds of iterations for each calculation even with 10 points for each parameter translates in billions of operations. Nonlinear fit should do, but those typically require partial derivatives which I wouldn't have in this case.
Is there a Java package for this sort of optimizations?
Thank you in advance,
Stepan

You have some alternatives - depending on the equations that govern the error parameter.
Pick a point in parameter space and use an iterative process to walk towards a minimum. Essentially, add a delta to each parameter and pick whichever reduces the error by the most - rince - repeat.
Pick each pareameter and perform a binary-chop search between its limits to find it's minimum. Will only work if the parameter's effect is linear.
Solve the system using some form of Operations-Research technique to track down a minimum.

Java: calculating velocity of a skydiver

In Java, I am trying to implement the following equation for calculating the current velocity of a skydiver not neglecting air resistance.
v(t) = v(t-∆t) + (g - [(drag x crossArea x airDensity) / (2*mass)] *
v[(t-∆t)^2] ) * (∆t)
My problem is that I am not sure how to translate "v(t - ∆t)" into a code. Right now I have this method below, where as you can see I am using the method within itself to find the previous velocity. This has continued to result in a stack overflow error message, understandably.
(timeStep = ∆t)
public double calculateVelocity(double time){
double velocity;
velocity = calculateVelocity(time - timeStep)
+ (acceleration - ((drag * crossArea * airDensity)
/ (2 * massOfPerson))
* (calculateVelocity(time - timeStep)*(time * timeStep)))
* timeStep;
}
return velocity;
}
I am calling the above method in the method below. Assuming that the ending time = an int, will be the user input but written this way to be dynamic.
public void assignVelocitytoArrays(){
double currentTime = 0;
while(currentTime <= endingTime){
this.vFinal = calculateVelocity(currentTime);
currentTime += timeStep;
}
}
I would like to figure this out on my own, could someone give me a general direction? Is using a method within itself the right idea or am I completely off track?

The formula you want to implement is the recursive representation of a sequence, mathematiacally speaking.
Recursive sequences need a starting point, e.g.
v(0) = 0 (because a negative time does not make sense)
and a rule to calculate the next elements, e.g.
v(t) = v(t-∆t) + (g - [(drag x crossArea x airDensity) / (2*mass)] * v[(t-∆t)^2] ) * (∆t)
(btw: are you sure it has to be v([t-∆t]^2) instead of v([t-∆t])^2?)
So your approach to use recursion (calling a function within itself) to calculate a recursive sequence is correct.
In your implementation, you only forgot one detail: the starting point. How should your program know that v(0) is not defined be the rule, but by a definite value? So you must include it:
if(input value == starting point){
return starting point
}
else{
follow the rule
}
On a side note: you seem to be creating an ascending array of velocities. It would make sense to use the already calculated values in the array instead of recursion, so you don't have to calculate every step again and again.
This only works if you did indeed make a mistake in the rule.
double[] v = new double[maxTime/timeStep];
v[0] = 0; //starting point
for(int t = 1; t < maxSteps; t++){
v[t] = v[t-1] + (g - [(drag x crossArea x airDensity) / (2*mass)] * v[t-1]^2 ) * (∆t)
}

Java code/library to calculate the earth mover's distance

I'm looking for java code (or a library) that calculates the earth mover's distance (EMD) between two histograms. This could be directly or indirectly (e.g. using the Hungarian algorithm). I found several implementations of this in c/c++ (e.g. "Fast and Robust Earth Mover's Distances", but I'm wondering if there is a Java version readily available.
I will be using the EMD calculation to evaluate the approach given by this paper in the context of a science project I'm working on.
Update
Using a variety of resources I estimate that the code below should do the trick. determineMinCostAssignment is the calculation of the optimal assignment as determined by the Hungarian algorithm. For this I will be using the code from http://konstantinosnedas.com/dev/soft/munkres.htm
My main concern is the calculated flow: I am not sure if this is correct. Is there someone who can verify that this is correct or not?
/**
* Determines the Earth Mover's Distance between two histogram assuming an equal distance between two buckets of a histogram. The distance between
* two buckets is equal to the differences in the indexes of the buckets.
*
* #param threshold
* The maximum distance to use between two buckets.
*/
public static double determineEarthMoversDistance(double[] histogram1, double[] histogram2, int threshold) {
if (histogram1.length != histogram2.length)
throw new InvalidParameterException("Each histogram must have the same number of elements");
double[][] groundDistances = new double[histogram1.length][histogram2.length];
for (int i = 0; i < histogram1.length; ++i) {
for (int j = 0; j < histogram2.length; ++j) {
int abs_diff = Math.abs(i - j);
groundDistances[i][j] = Math.min(abs_diff, threshold);
}
}
int[][] assignment = determineMinCostAssignment(groundDistances);
double costSum = 0, flowSum = 0;
for (int i = 0; i < assignment.length; i++) {
double cost = groundDistances[assignment[i][0]][assignment[i][1]];
double flow = histogram2[assignment[i][1]];
costSum += cost * flow;
flowSum += flow;
}
return costSum / flowSum;
}

Here's a pure Java port of the FastEMD algorithm, that I just released:
https://github.com/telmomenezes/JFastEMD

The website "Fast and Robust Earth Mover's Distances" has a Java wrapper for the C/C++ code with compiled binary for Linux and Windows.

This is what I use for Java/Scala:
import org.apache.commons.math3.ml.distance.EarthMoversDistance
new EarthMoversDistance().compute(observed, expected)

https://github.com/wihoho/VideoRecognition
Adapt the author's C implementation with python module through a file interface
The modified C codes are under the folder EarthMoverDistance SourceCode
I am pretty sure that you can do the same thing with Java. Just add a file interface to connect the C implementation of EMD with your Java codes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.