I am trying to develop a java code for data mining algorithm i.e. k-apriori algorithm which improves the performance of apriori algorithm. As I have already developed 1) apriori & 2) apriori based on boolean matrix. The thing which I am not able to understand is how the wiener function helps to transform the data. Why we use it in this algorithm. I tried to search over google for example of K-apriori algorithm but not able to find any example. I know the working of K-means algorithm. If any one have example K-apriori as specially how it works it will be helpful.
Here is the link from which I am referring the K-apriori algorithm.
I never implemented k-apriori myself but if I am right it is just Apriori working in K clusters found by K-means
As you know K-means is based on the concept of cluster centroids. Usually the binary data clustering is done by using 0 and 1 as numerical value. But that is very problematic when it comes to calculating centroids from data. If you have binary data distance between two points is just number of bits that are different between two points. You can read more about this problem in this link
To get any meaningful clusters K-means should operate on real values. And that's why you use wiener function to transform binary values into real values which helps K-means get satisfying results
Wiener function - They perform it on each binary vector as follows:
Calculate the mean µ for the input vector Xi around each element
Calculate the variance σ^2 of each element
Perform wiener transformation for each element in the vector using equation Y based on its neighborhood
Assuming you have binary matrix size X of size pxq and vector V which is n-th row of that matrix. Let choose neighbourhood window 3. For n-th position of V vector
µ = 1/3 * ( V[n-1] + V[n] + V[n+1] )
σ^2 = 1/3 * ( ( V[n-1]-µ )^2 + ( V[n]-µ )^2 + ( V[n+1]-µ )^2 )
Y[n] = µ + (σ^2 - λ^2)/σ^2 * ( V[n] - µ )
where λ^2 is the average of all the local estimated variances, so f.e. assuming length of vector V = 5:
λ^2 = (σ^2[0]+σ^21+σ^2[2]+σ^2[3]+σ^2[4])/5
Related
Whilst searching on Google about Genetic Algorithms, I came across OneMax Problem, my search showed that this is one of the very first problem that the Genetic Algorithm was applied to. However, I am not exactly sure what is OneMax problem. Can anyone explain.
Any help is appreciated
The goal of One-Max problem is to create a binary string of length n where every single gene contains a 1. The fitness function is very simple, you just iterate through your binary string counting all ones. This is what the sum represents in the formula you provided with your post. It is just the number of ones in the binary string. You could also represent the fitness as a percentage, by dividing the number of ones by n * 0.01. A higher fitness would have a higher percentage. Eventually you will get a string of n ones with a fitness of 100% at some generation.
double fitness(List<int> chromosome) {
int ones = chromosome.stream().filter(g -> g == 1).count();
return ones / chromosome.size() * 0.01;
}
I've run into a problem where I have to be able to generate a set of randomly chosen numbers of a multivariate normal distribution with mean 0 and a given 3*3 variance-covariance matrix in Java.
Is there an easy way as to do this?
1) Use a library implementation, as suggested by Dima.
Or, if you really feel a burning need to do this yourself:
2) Assuming you want to generate normals with a mean vector M and variance/covariance matrix V, perform Cholesky Decomposition on V to come up with lower triangular matrix L such that V=LLt (where the superscript t indicates transpose). Generate a vector Z of three independent standard normals (using Random.nextGaussian() to get the individual elements). Then LZ + M will have the desired multivariate normal distribution.
Apache Commons has what you are looking for:
MultivariateNormalDistribution mnd = new MultivariateNormalDistribution(means, covariances);
double vals[] = mnd.sample();
I'm working on a clustering task and I've built the dataset similarity matrix M, repeating N times the clustering algorithm and chosing as element m_{ij} the the number of times the elements i and j have been on the same cluster, divided by N.
Now I'd like to have a graphical way to check my results, so i was wondering if there is any library that, given an array of doubles ( aka the values in the upper triangular part of my matrix ), plots the data distribution and the histogram ? All the doubles are in the [0,1] interval, and most of the will be around 0 and 1 .
For plotting, the best programming language that I know is Matlab. If you have chance, I would definitely suggest you Matlab.
Write your results into a txt file and just insert it into Matlab, then you can play with that numbers and do plottings with almost no effort
I am working on a project in java and have two 2d int arrays both 10x15. I want to convert the Mahalanobis distance between them. They are grouped in categories along the x axis of the array (size 10). I understand that you must find the mean value in these groups and redistribute the data so that it is centered. My problem now is generating the covariance matrix necessary for calculation. If anyone knows a good way to do this or point to a useful guide that can step me through the process in 3D it would be a great help. Thanks.
A covariance matrix contains the expected relationship between any two variables. Given a statistical distribution on a vector x, with statistical mean avg:
covariance(i,j) = expected value of [ (x[i] - avg[i])(x[j] - avg[j]) ]
Given a statistical set of N vectors v_1 ... v_N, with mean vector avg, you can estimate the covariance of the distribution they were taken from as follows:
sample_covariance(i,j) = sum[for k=1..N]( (v_k[i] - avg[i])*(v_k[j] - avg[j]) ) / (N-1)
This last is the covariance matrix you're looking for. I recommend you also read the wiki link above.
Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html
However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.
Am I missing something? Is there a java library which does this?
thanks
If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?
Hashmap is good, but it might take a lot of memory.
If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:
i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
if vec1[i].key == vec2[j].key:
result = result + vec1[i].value * vec2[j].value
else if vec1[i].key < vec2[j].key:
i = i + 1
else
j = j + 1
If you are planning on using cosine similarity as a way of finding clusters of similar documents, you may want to consider looking into locality-sensitive hashing, a hash-based approach that was designed specifically with this in mind. Intuitively, LSH hashes the vectors in a way that with high probability places similar elements into the same bucket and distant elements into different buckets. There are LSH schemes that use cosine similarity as their underlying distance, so to find clusters you use LSH to drop things into buckets and then only compute the pairwise distances of elements in the same bucket. In the worst case this will be quadratic (if everything falls in the same bucket), but it's much more likely that you'll have a significant dropoff in work.
Hope this helps!
If you only want to recommend limited items, for example m items, to every item in a set with size of n, the complexity need not to be n^2, but m*n. Since m is a constant, the complexity is linear.
You can check with the project simbase https://github.com/guokr/simbase , it is a vector similarity nosql database.
Simbase use below concepts:
Vector set: a set of vectors
Basis: the basis for vectors, vectors in one vector set have same basis
Recommendation: a one-direction binary relationship between two vector sets which have the same basis