What is the fastest way to compute an epsilon closure?

What is the fastest way to compute an epsilon closure? - java

I am working on a program to convert Non-deterministic finite state automata (NFAs) to Deterministic finite state automata (DFAs). To do this, I have to compute the epsilon closure of every state in the NFA that has an epsilon transition. I have already figured out a way to do this, but I always assume that the first thing I think of is usually the least efficient way to do something.
Here is an example of how I would compute a simple epsilon closure:
Input strings for transition function: format is startState, symbol = endState
EPS is an epsilon transition
1, EPS = 2
Results in the new state { 12 }
Now obviously this is a very simple example. I would need to be able to compute any number of epsilon transitions from any number of states. To this end, my solution is a recursive function that computes the epsilon closure on the given state by looking at the state it has an epsilon transition into. If that state has (an) epsilon transition(s) then the function is called recursively within a for loop for as many epsilon transitions as it has. This will get the job done but probably isn't the fastest way. So my question is this: what is the fastest way to compute an epsilon closure in Java?

Depth first search (or breadth first search - doesn't really matter) over the graph whose edges are your epilson transitions. So in other words, your solution is optimal provided you efficiently track which states you've already added to the closure.

JFLAP does this. You can see their source - specifically ClosureTaker.java. It's a depth-first search (which is what Peter Taylor suggested), and since JFLAP uses it I assume that's the near-optimal solution.

Did you look into an algorithm book? But I doubt you'll find a significantly better approach. But the actual performance of this algorithm may very well depend on the concrete data structure you use to implement your graph. And you can share work, depending on the order you simplify your graph. Think about subgraphs which are epsilon connected and are referenced from two different nodes.
I am not sure whether this can be done in an optimal way, or whether you have to resort to some heuristics.
Scan the the literature on algorithms.

Just so that people looking only for the specific snippet of code referenced by #Xodarap 's answer don't find themselves in the need of downloading both the source code and an application to view the code of the jar file, I took the liberty to attach said snippet.
public static State[] getClosure(State state, Automaton automaton) {
List<State> list = new ArrayList<>();
list.add(state);
for (int i = 0; i < list.size(); i++) {
state = (State) list.get(i);
Transition transitions[] = automaton.getTransitionsFromState(state);
for (int k = 0; k < transitions.length; k++) {
Transition transition = transitions[k];
LambdaTransitionChecker checker = LambdaCheckerFactory
.getLambdaChecker(automaton);
/** if lambda transition */
if (checker.isLambdaTransition(transition)) {
State toState = transition.getToState();
if (!list.contains(toState)) {
list.add(toState);
}
}
}
}
return (State[]) list.toArray(new State[0]);
}
It goes without saying that all credit goes to #Xodarap and the JFLAP project.

Related

Is there any ojAlgo solver for when the condition number is large and the matrix is symmetric and indefinite?

I use ojAlgo to solve a system of linear equations.
In one case I get a RecoverableCondition exception. Probably because matrix is ill-conditioned, the condition number is about 1e15.
I use ojAlgo to solve it as seen in the code below. It usually works, but not in this case.
Is there any other solver I could use for a symmetric indefinite (ill-conditioned) matrix?
The present failing size is 18x18 but later 1000x1000 might be needed. Since its part of a iterative algorithm the accuracy is not super important.
SolverTask<Double> equationSolver = SolverTask.PRIMITIVE.make(KKT, rhs.negate());
MatrixStore<Double> deltaX = null;
try {
deltaX = equationSolver.solve(KKT, rhs.negate());
} catch (RecoverableCondition ex) {
int i = 0;
}
I tried to reproduce this in a self contained example but failed, because there it works. Maybe I do not get exactly the same matrix down to the last bit.

In your case, that method would use a Cholesky decomposition as the solver.
If here's a problem then try to pick another decomposition by instantiating a suitable alternative directly. An SVD can usually handle anything, but that would be very expensive. Perhaps QR can be ok.
QR<Double> qr = QR.PRIMITIVE.make(templateBody);
qr.decompose(body);
qr.getSolution(rhs,x);
This way you can reuse the decomposition instance as well as the solution vector x.
Another alternative is to precondition the body/KKT matrix. Perhaps add a small diagonal - just enough to make the Cholesky decomposition solvable.
if (!cholesky.isSolvable()) {
// Fix that
}
Or perhaps try something in the org.ojalgo.matrix.task.iterative package.

How to check general conditions?

I have translated a code from Matlab (array orientation) to Java(OOP). The problem appeared when I have to translate this feature of Matlab:
min_omegaF_restricted=min(omegaF(p>cost));
Here, omegaF is a vector with the net worth of each firm.
p a vector of prices for each firm.
cost a vector of costs for each firm.
The above calculate the minimal net worth of survivor firms with demanded price higher than their costs.
In java can be translated to:
double min_omegaF_restricted=Double.POSITIVE_INFINITY;
for(int f=0; f<F; f++){
Firm fo=firms.get(f);
if(fo.p>fo.cost)
min_omegaF_restricted=Math.min(min_omegaF_restricted, fo.omegaF);
}
Is there a option to generalize this kind of sentence to whatever condition (fo.p>fo.cost)?

Yes. Functional interfaces and lambdas in Java 8 make this simple and easy on the eyes. Create a Predicate which tests an object and returns a boolean value.
Predicate<Firm> predicate = firm -> firm.p > firm.cost;
Then you can defer to the predicate in the loop like so:
double min_omegaF_restricted=Double.POSITIVE_INFINITY;
for(int f=0; f<F; f++){
Firm fo=firms.get(f);
if(predicate.test(fo))
min_omegaF_restricted=Math.min(min_omegaF_restricted, fo.omegaF);
}
What's more, with the new streaming API you can express the whole computation functionally without the explicit for loop.
double min_omegaF_restricted = firms.stream()
.filter(predicate)
.mapToDouble(f -> f.omegaF)
.min()
.orElse(Double.POSITIVE_INFINITY);

n-body Simulation expected performance barnes hut

I made a 2D n-body simulation using brute force at first, but then following http://arborjs.org/docs/barnes-hut this I've implemented a Barnes-Hut approximation algorithm. However this didn't give me the effect I was looking for.
Ex:
Barnes-Hut -> 2000 Bodies; frametime avg. 32 ms and 5000; 164 ms
Brute force -> 2000 Bodies; frametime avg. 31 ms and 5000; 195 ms
These values are with rendering turned off.
Am I correct to assume that I haven't correctly implemented the algorithm and am thus not getting a substantial increase in performance?
Theta is currently set to s/d < 0.5. Changing this value to e.g. 1 does increase performance, but it's quite obvious why this isn't preferred.
Single threaded
My code along general lines:
while(!close)
{
long newTime = System.currentTimeMillis();
long frameTime = newTime-currentTime;
System.out.println(frameTime);
currentTime = newTime;
update the bodies
}
Within the function that updates the bodies:
first insert all bodies into the quadtree with all its subnodes
for all bodies
{
compute the physics using Barnes-Hut which yields a net force per planet (doPhysics(body))
calculate instantaneous acceleration from net force
update the instantaneous velocity
}
The barneshut function:
doPhysics(body)
{
if(node is external (contains 1 body) and that body is not itself)
{
calculate the force between those two bodies
}else if(node is internal and s/d < 0.5)
{
create a pseudobody at the COM with the nodes total mass
calculate the force between the body and pseudobody
}else (if is it internal but s/d >= 0.5)
{
(this is where recursion comes in)
doPhysics on same body but on the NorthEast subnode
doPhysics on same body but on the NorthWest subnode
doPhysics on same body but on the SouthEast subnode
doPhysics on same body but on the SouthWest subnode
}
}
Actually calculating the force:
calculateforce(body, otherbody)
{
if(body is not at exactly the same position (avoid division by 0))
{
calculate force using newtons law of gravitation in vector form
add the force to the bodies' net force in this frame
}
}

Your code is still incomplete (read on SSCCEs ), and in-depth debugging of incomplete code is not the purpose of the site. However, this is how I would approach the next steps of figuring what, if anything, is wrong:
time only the function that you are worried about (let us call it barnes_hutt_update()); and not the whole update loop. Compare that to the equivalent, non-B-H code, and not to the whole update loop without B-H. This would result in a much more meaningful comparison.
you seem to have hard-coded s/d 0.5 into your algorithm. Leaving it as an argument, you should be able to notice speedups when it is set higher; and very similar performance to a naive, non-B-H implementation if set to 0. Speedup in B-H comes from evaluating less nodes (because far-away nodes are lumped together); do you know how many nodes you are managing to skip? No skipped nodes, no speedup. On the other hand, skipping nodes introduces small errors in the calculation - have you quantified those?
have a look at other implementations of B-H online. D3's force layout uses it internally, and is quite readable. There are multiple existing quadtree implementations. If you have built your own, they may be sub-optimal (or even buggy). Unless you are trying to learn-by-doing, it is always better to use a tested library instead of rolling your own.
slowdown may be due to the use of quadtrees, rather than from force addition itself. It would be useful to know how long building and updating the quadtree is taking, as compared to the B-H force aproximation itself -- because quadtrees are, in this case, pure overhead. B-H needs quadtrees, but the naive, non-B-H implementation does not. For small amounts of nodes, naive will be faster (but will get slower very fast as you add more and more). How does the performance scale as you add more and more bodies?
are you creating and discarding large amounts of objects? You can make your algorithm avoid the associated overhead (yes, lots of news + garbage collection can result in significant slowdowns) by using a memory pool.

Compute probability over a multivariate normal

My question addresses both mathematical and CS issues, but since I need a performant implementation I am posting it here.
Problem:
I have an estimated normal bivariate distribution, defined as a python matrix, but then I will need to transpose the same computation in Java. (dummy values here)
mean = numpy.matrix([[0],[0]])
cov = numpy.matrix([[1,0],[0,1]])
When I receive in inupt a column vector of integers values (x,y) I want to compute the probability of that given tuple.
value = numpy.matrix([[4],[3]])
probability_of_value_given_the_distribution = ???
Now, from a matematical point of view, this would be the integral for 3.5 < x < 4.5 and 2.5 < y < 3.5 over the probability density function of my normal.
What I want to know:
Is there a way to avoid the effective implementation of this, that implies dealing with expressions defined over matrices and with double integrals? Besides that it will take me a while if I had to implement it by myself, this would be computationally expensive. An approximate solution would be perfectly fine for me.
My reasonings:
In an univariate normal, one could simply use the cumulative distribution function (or even store its values for the standard one and then normalize), but unfortunately there appears not to be a closed cdf form for multivariates.
Another approach for univariate is to use the inverse of bivariate approximation (so, approximate a normal as a binomial), but extending this to the multivariate I can't figure out how to keep in count the covariances.
I really hope someone has already implemented this, I need it soon (finishing my thesis) and I couldn't find anything.

OpenTURNS provides an efficient implementation of the CDF of a multinormal distribution (see the code).
import numpy as np
mean = np.array([0.0, 0.0])
cov = np.array([[1.0, 0.0],[0.0, 1.0]])
Let us create the multinormal distribution with these parameters.
import openturns as ot
multinormal = ot.Normal(mean, ot.CovarianceMatrix(cov))
Now let us compute the probability of the square [3.5, 4.5] x |2.5, 3.5]:
prob = multinormal.computeProbability(ot.Interval([3.5,2.5], [4.5,3.5]))
print(prob)
The computed probability is
1.3701244220201715e-06

If you are looking for the probabiliy density function of a bivariate normal distribution, below are a few lines that could do the job:
import numpy as np
def multivariate_pdf(vector, mean, cov):
quadratic_form = np.dot(np.dot(vector-mean,np.linalg.inv(cov)),np.transpose(vector-mean))
return np.exp(-.5 * quadratic_form)/ (2*np.pi * np.linalg.det(cov))
mean = np.array([0,0])
cov = np.array([[1,0],[0,1]])
vector = np.array([4,3])
pdf = multivariate_pdf(vector, mean, cov)

Java: micro-optimizing array manipulation

I am trying to make a Java port of a simple feed-forward neural network.
This obviously involves lots of numeric calculations, so I am trying to optimize my central loop as much as possible. The results should be correct within the limits of the float data type.
My current code looks as follows (error handling & initialization removed):
/**
* Simple implementation of a feedforward neural network. The network supports
* including a bias neuron with a constant output of 1.0 and weighted synapses
* to hidden and output layers.
*
* #author Martin Wiboe
*/
public class FeedForwardNetwork {
private final int outputNeurons; // No of neurons in output layer
private final int inputNeurons; // No of neurons in input layer
private int largestLayerNeurons; // No of neurons in largest layer
private final int numberLayers; // No of layers
private final int[] neuronCounts; // Neuron count in each layer, 0 is input
// layer.
private final float[][][] fWeights; // Weights between neurons.
// fWeight[fromLayer][fromNeuron][toNeuron]
// is the weight from fromNeuron in
// fromLayer to toNeuron in layer
// fromLayer+1.
private float[][] neuronOutput; // Temporary storage of output from previous layer
public float[] compute(float[] input) {
// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
neuronOutput[0][i] = input[i];
}
// Loop through layers
for (int layer = 1; layer < numberLayers; layer++) {
// Loop over neurons in the layer and determine weighted input sum
for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) {
// Bias neuron is the last neuron in the previous layer
int biasNeuron = neuronCounts[layer - 1];
// Get weighted input from bias neuron - output is always 1.0
float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];
// Get weighted inputs from rest of neurons in previous layer
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][inputNeuron][neuron];
}
// Store neuron output for next round of computation
neuronOutput[layer][neuron] = sigmoid(activation);
}
}
// Return output from network = output from last layer
float[] result = new float[outputNeurons];
for (int i = 0; i < outputNeurons; i++)
result[i] = neuronOutput[numberLayers - 1][i];
return result;
}
private final static float sigmoid(final float input) {
return (float) (1.0F / (1.0F + Math.exp(-1.0F * input)));
}
}
I am running the JVM with the -server option, and as of now my code is between 25% and 50% slower than similar C code. What can I do to improve this situation?
Thank you,
Martin Wiboe
Edit #1: After seeing the vast amount of responses, I should probably clarify the numbers in our scenario. During a typical run, the method will be called about 50.000 times with different inputs. A typical network would have numberLayers = 3 layers with 190, 2 and 1 neuron, respectively. The innermost loop will therefore have about 2*191+3=385 iterations (when counting the added bias neuron in layers 0 and 1)
Edit #1: After implementing the various suggestions in this thread, our implementation is practically as fast as the C version (within ~2 %). Thanks for all the help! All of the suggestions have been helpful, but since I can only mark one answer as the correct one, I will give it to #Durandal for both suggesting array optimizations and being the only one to precalculate the for loop header.

Some tips.
in your inner most loop, think about how you are traversing your CPU cache and re-arrange your matrix so you are accessing the outer most array sequentially. This will result in you accessing your cache in order rather than jumping all over the place. A cache hit can be two orders of magniture faster than a cache miss.
e.g restructure fWeights so it is accessed as
activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][neuron][inputNeuron];
don't perform work inside the loop (every time) which can be done outside the loop (once). Don't perform the [layer -1] lookup every time when you can place this in a local variable. Your IDE should be able to refactor this easily.
multi-dimensional arrays in Java are not as efficient as they are in C. They are actually multiple layers of single dimensional arrays. You can restructure the code so you're only using a single dimensional array.
don't return a new array when you can pass the result array as an argument. (Saves creating a new object on each call).
rather than peforming layer-1 all over the place, why not use layer1 as layer-1 and using layer1+1 instead of layer.

Disregarding the actual math, the array indexing in Java can be a performance hog in itself. Consider that Java has no real multidimensional arrays, but rather implements them as array of arrays. In your innermost loop, you access over multiple indices, some of which are in fact constant in that loop. Part of the array access can be move outside of the loop:
final int[] neuronOutputSlice = neuronOutput[layer - 1];
final int[][] fWeightSlice = fWeights[layer - 1];
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
activation += neuronOutputSlice[inputNeuron] * fWeightsSlice[inputNeuron][neuron];
}
It is possible that the server JIT performs a similar code invariant movement, the only way to find out is change and profile it. On the client JIT this should improve performance no matter what.
Another thing you can try is to precalculate the for-loop exit conditions, like this:
for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) { ... }
// transform to precalculated exit condition (move invariant array access outside loop)
for (int neuron = 0, neuronCount = neuronCounts[layer]; neuron < neuronCount; neuron++) { ... }
Again the JIT may already do this for you, so profile if it helps.
Is there a point to multiplying with 1.0F that eludes me here?:
float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];
Other things that could potentially improve speed at cost of readability: inline sigmoid() function manually (the JIT has a very tight limit for inlining and the function might be larger).
It can be slightly faster to run a loop backwards (where it doesnt change the outcome of course), since testing the loop index against zero is a little cheaper than checking against a local variable (the innermost loop is a potentical candidate again, but dont expect the output to be 100% identical in all cases, since adding floats a + b + c is potentially not the same as a + c + b).

For a start, don't do this:
// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
neuronOutput[0][i] = input[i];
}
But this:
System.arraycopy( input, 0, neuronOutput[0], 0, inputNeurons );

First thing I would look into is seeing if Math.exp is slowing you down. See this post on a Math.exp approximation for a native alternative.

Replace the expensive floating point sigmoid transfer function with an integer step transfer function.
The sigmoid transfer function is a model of organic analog synaptic learning, which in turn seems to be a model of a step function.
The historical precedent for this is that Hinton designed the back-prop algorithm directly from the first principles of cognitive science theories about real synapses, which in turn were based on real analog measurements, which turn out to be sigmoid.
But the sigmoid transfer function seems to be an organic model of the digital step function, which of course cannot be directly implemented organically.
Rather than model a model, replace the expensive floating point implementation of the organic sigmoid transfer function with the direct digital implementation of a step function (less than zero = -1, greater than zero = +1).
The brain cannot do this, but backprop can!
This not only linearly and drastically improves performance of a single learning iteration, it also reduces the number of learning iterations required to train the network: supporting evidence that learning is inherently digital.
Also supports the argument that Computer Science is inherently cool.

Purely based upon code inspection, your inner most loop has to compute references to a three-dimensional parameter and its being done a lot. Depending upon your array dimensions could you possibly be having cache issues due to have to jump around memory with each loop iteration. Maybe you could rearrange the dimensions so the inner loop tries to access memory elements that are closer to one another than they are now?
In any case, profile your code before making any changes and see where the real bottleneck is.

I suggest using a fixed point system rather than a floating point system. On almost all processors using int is faster than float. The simplest way to do this is simply shift everything left by a certain amount (4 or 5 are good starting points) and treat the bottom 4 bits as the decimal.
Your innermost loop is doing floating point maths so this may give you quite a boost.

The key to optimization is to first measure where the time is spent. Surround various parts of your algorithm with calls to System.nanoTime():
long start_time = System.nanoTime();
doStuff();
long time_taken = System.nanoTime() - start_time;
I'd guess that while using System.arraycopy() would help a bit, you'll find your real costs in the inner loop.
Depending on what you find, you might consider replacing the float arithmetic with integer arithmetic.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.