How should I implement a Mahalanobis distance function in Java?

How should I implement a Mahalanobis distance function in Java? - java

I am working on a project in java and have two 2d int arrays both 10x15. I want to convert the Mahalanobis distance between them. They are grouped in categories along the x axis of the array (size 10). I understand that you must find the mean value in these groups and redistribute the data so that it is centered. My problem now is generating the covariance matrix necessary for calculation. If anyone knows a good way to do this or point to a useful guide that can step me through the process in 3D it would be a great help. Thanks.

A covariance matrix contains the expected relationship between any two variables. Given a statistical distribution on a vector x, with statistical mean avg:
covariance(i,j) = expected value of [ (x[i] - avg[i])(x[j] - avg[j]) ]
Given a statistical set of N vectors v_1 ... v_N, with mean vector avg, you can estimate the covariance of the distribution they were taken from as follows:
sample_covariance(i,j) = sum[for k=1..N]( (v_k[i] - avg[i])*(v_k[j] - avg[j]) ) / (N-1)
This last is the covariance matrix you're looking for. I recommend you also read the wiki link above.

Related

Java: Reduce array to specific number of averages

The main issue which needs to be solved is:
Let's say I have an array with 8 numbers, e.g. [2,4,8,3,5,4,9,2] and I use them as values for my x axis in an coordinate system to draw a line. But I can only display 3 of this points.
What I need to do now is do reduce the number of points (8) to 3, without manipulating the line too much - so using an average should be an option.
I am NOT looking for the average of the array in a whole - I still need 3 points of the amount of 8 in total.
For an array like [2,4,2,4,2,4,2,4] and 4 numbers out of that array, I could simply use the average "3" of each pair - but that's not possible if the number is uneven.
But how would I do that? Do you know how this process is called in a mathematical way?
To give you some more realistic details about this issue: I have an x axis, which is 720px long and let's say I get 1000 points. Now I have to reduce this 1000 points (2 arrays, one for x and one for y values) to a maximum of 720 points.
Thought about interpolation and stuff like that, but I'm still not quite sure if this is what I am looking for.

Interpolation is good idea. You input your points and get a polynomial function as an output. Then you can use it to draw your line. Check more here : Interpolation over an array (or two)

I would recommend that you fit all the points you have in some fashion and then evaluate at the particular points you need for the display.
There are a myriad of choices for fitting:
Least squares
Piecewise using polynomials or splines
You should consult a text or find a library to help you - something like Apache Commons Math.

It sounds like you are looking for a more advanced mathematical function than a simple average.
I would suggest trying to identify potential algorithms via Mathematica Stack Exchange and then trying to find a Java library that implements any of the potential choices (maybe a new question here).

since its for an X-axis, why not use the
MIN, MAX and (MIN+MAX)/2
for your three points?

Working of K-apriori Algorithm

I am trying to develop a java code for data mining algorithm i.e. k-apriori algorithm which improves the performance of apriori algorithm. As I have already developed 1) apriori & 2) apriori based on boolean matrix. The thing which I am not able to understand is how the wiener function helps to transform the data. Why we use it in this algorithm. I tried to search over google for example of K-apriori algorithm but not able to find any example. I know the working of K-means algorithm. If any one have example K-apriori as specially how it works it will be helpful.
Here is the link from which I am referring the K-apriori algorithm.

I never implemented k-apriori myself but if I am right it is just Apriori working in K clusters found by K-means
As you know K-means is based on the concept of cluster centroids. Usually the binary data clustering is done by using 0 and 1 as numerical value. But that is very problematic when it comes to calculating centroids from data. If you have binary data distance between two points is just number of bits that are different between two points. You can read more about this problem in this link
To get any meaningful clusters K-means should operate on real values. And that's why you use wiener function to transform binary values into real values which helps K-means get satisfying results
Wiener function - They perform it on each binary vector as follows:
Calculate the mean µ for the input vector Xi around each element
Calculate the variance σ^2 of each element
Perform wiener transformation for each element in the vector using equation Y based on its neighborhood
Assuming you have binary matrix size X of size pxq and vector V which is n-th row of that matrix. Let choose neighbourhood window 3. For n-th position of V vector
µ = 1/3 * ( V[n-1] + V[n] + V[n+1] )
σ^2 = 1/3 * ( ( V[n-1]-µ )^2 + ( V[n]-µ )^2 + ( V[n+1]-µ )^2 )
Y[n] = µ + (σ^2 - λ^2)/σ^2 * ( V[n] - µ )
where λ^2 is the average of all the local estimated variances, so f.e. assuming length of vector V = 5:
λ^2 = (σ^2[0]+σ^21+σ^2[2]+σ^2[3]+σ^2[4])/5

Random multivariate normal distribution

I've run into a problem where I have to be able to generate a set of randomly chosen numbers of a multivariate normal distribution with mean 0 and a given 3*3 variance-covariance matrix in Java.
Is there an easy way as to do this?

1) Use a library implementation, as suggested by Dima.
Or, if you really feel a burning need to do this yourself:
2) Assuming you want to generate normals with a mean vector M and variance/covariance matrix V, perform Cholesky Decomposition on V to come up with lower triangular matrix L such that V=LLt (where the superscript t indicates transpose). Generate a vector Z of three independent standard normals (using Random.nextGaussian() to get the individual elements). Then LZ + M will have the desired multivariate normal distribution.

Apache Commons has what you are looking for:
MultivariateNormalDistribution mnd = new MultivariateNormalDistribution(means, covariances);
double vals[] = mnd.sample();

Fourier Transform and Fourier Descriptors to extract shapes features on Java

I am trying to build a simple system to recognize simple shapes using Fourier descriptors:
I am using this implementation of Fast fourier transform on my program: (link below)
http://www.wikijava.org/wiki/The_Fast_Fourier_Transform_in_Java_%28part_1%29
fft(double[] inputReal, double[] inputImag, boolean direction)
inputs are: real and imag part (which are essentially x,y coordinates of boundary parameter I have)
and outputs are the transformed real and imag numbers.
question: How can i use the output (transformed real,imag ) as a invariant descriptors of my simple shapes?
This was what I thought :
calculate R = sqrt( real^2 + imag^2 ) for each N steps.
divide each R by R[1] = the normalization factor to make it invariant.
The problem is I get very different R values for slightly different images (such as slight rotations applied, etc)
In other words :
My descriptors are not invariant... I think I am doing something wrong with getting the R value.

There is some theory you need to know first about Fourier Descriptors: it's an extremely interesting technique, but should be devised correctly. What you want is invariance; invariance for rotation, translation, maybe even affine transforms. To allow good comparison with other sets of Fourier descriptors you should take following things in consideration:
if you want invariance to translation, do not use the DC-term, that is the first element in your resulting array of Fourier coefficients
if you want invariance to scaling, make the comparison ratio-like, for example by dividing every Fourier coefficient by the DC-coefficient. f*[1] = f[1]/f[0], f*[2]/f[0], and so on.
if you want invariance to the start point of your contour, only use absolute values of the resulting Fourier coefficients.
Only the first 5 to 8 Fourier coefficients are useful when comparing the coefficients of two different objects; higher coefficients only go into the details of your contour which mostly isn't very useful information. (it's the global form that matters)
Let's say you have 2 objects, and their Fourier descriptors. The resulting array of Fourier coefficients can be of a different size, meaning that the 'frequency interval' of the resulting frequency content is different for both shapes. You can't compare apples with pears. Zero-pad your shortest contour to match the size of the longest contour, and then calculate the Fourier descriptors. Now you have analogy between coefficients and a good comparison.
Hope this helps. Btw, user-made FFT solutions are not to be trusted in my opinion. Go for the solutions libraries reach out. If working with images, OpenCV provides Fourier transform utilities.

If you are looking to match different shapes, try using different shape descriptors from MPEG-7 standard. You will probably need a classifier, take a look at SVM, Boosting, Neural Networks ...: http://docs.opencv.org/modules/ml/doc/ml.html

Mapping Pixels to Data

I've written some basic graphing software in Clojure/Java using drawLine() on the graphics context of a modified JPanel. The plotting itself is working nicely, but I've come to an impasse while trying to converting a clicked pixel to the nearest data point.
I have a simple bijection between the list of all pixels that mark end points of my lines and my actual raw data. What I need is a surjection from all the pixels (say, 1200x600 px2) of my graph window to the pixels in my pixel list, giving me a trivial mapping from that to my actual data points.
e.g.
<x,y>(px) ----> <~x,~y>(pixel points) ----> <x,y>(data)
This is the situation as I'm imagining it now:
A pixel is clicked in the main graph window, and the MouseListener catches that event and gives me the <x,y> coordinates of the action.
That information is passed to a function that returns a predicate which determines whether or not a value passed to it is "good enough", and filter though the list with that pred, and take the first value it okays.
Possibly, instead of a predicate, it returns a function which is passed the list of the pixel-points, and returns a list of tuples (x index) which indicate how good the point is with the magnitude of x, and where that point is with index. I'd do this with both the x points and the y points. I then filter though that and find the one with the max x, and take that one to be the point which is most likely to be the one the user meant.
Are these reasonable solutions to this problem? It seems that the solution which involves confidence ratings (distance from pix-pt, perhaps) may be too processor heavy, and a bit memory heavy if I'm holding all the points in memory again. The other solution, using just the predicate, doesn't seem like it'd always be accurate.
This is a solved problem, as other graphing libraries have shown, but it's hard to find information about it other than in the source of some of these programs, and there's got to be a better way then to dig through the thousands of lines of Java to find this out.
I'm looking for better solutions, or just general pointers and advice on the ones I've offered, if possible.

So I'm guessing something like JFreeChart just wasn't cutting it for your app? If you haven't gone down that road yet, I'd suggest checking it out before attempting to roll your own.
Anyway, if you're looking for the nearest point to a mouse event, getting the point with the minimum Euclidean distance (if it's below some threshold) and presenting that will give the most predictable behavior for the user. The downside is that Euclidean distance is relatively slow for large data sets. You can use tricks like ignoring the square root or BSP trees to speed it up a bit. But if those optimizations are even necessary really depends on how many data points you're working with. Profile a somewhat naive solution in a typical case before going into optimization mode.

I think your approach is decent. This basically only requires one iteration through your data array, a little simple maths and no allocations at each step so should be very fast.
It's probably as good as you are going to get unless you start using some form of spatial partitioning scheme like a quadtree, which would only really make sense if your data array is very large.
Some Clojure code which may help:
(defn squared-distance [x y point]
(let [dx (- x (.x point))
dy (- y (.y point))]
(+ (* dx dx) (* dy dy))))
(defn closest
([x y points]
(let [v (first points)]
(closest x y (rest points) (squared-distance x y v) v)))
([x y points bestdist best]
(if (empty? points)
best
(let [v (first points)
dist (squared-distance x y v)]
(if (< dist bestdist)
(recur x y (rest points) dist v)
(recur x y (rest points) bestdist best))))))

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.