how to implement k-means for simple grouping in java

how to implement k-means for simple grouping in java - java

I would like to know simple k-means algorithm in java. I want to use k-means only for grouping one dimensional array not multi.
For example,
before grouping the array consists of 2,4,7,5,12,34,18,25
if we want four group then we got
group 1: 2,4,5
group 2: 7,12
group 3: 18,25
group 4: 34

You can take a look at the Weka implementation or simply use the Weka API if all you need are the clusters and not the implementation.

The standard (heuristic) algorithm for K-means clustering is presented on the Wikipedia page, together with links to variations and some existing implementations.
(This is programming forum, so it is reasonable to assume that you are capable of writing Java code yourself ... if you cannot find an existing implementation that it is suitable.)

You can implement k-Means as:
SimpleKMeans kmeans = new SimpleKMeans();
kmeans.setSeed(10);
// This is the important parameter to set
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(numberOfClusters);
kmeans.buildClusterer(instances);
// This array returns the cluster number (starting with 0) for each instance
// The array has as many elements as the number of instances
int[] assignments = kmeans.getAssignments();
int i=0;
for(int clusterNum : assignments) {
System.out.printf("Instance %d -> Cluster %d", i, clusterNum);
i++;
}

You can check my software : SPMF data mining software.
It offers an efficient implementation of KMeans in just 3 files so it should be easy to understand.
The software also offers many other algorithms. But you don't need them.
But another thing is that there is also a graphical user interface for launching KMeans and the other algorithms.

Related

Calculating the principal axis/eigenvalues and -vectors of large dataset in Java

I have a large dataset (>500.000 elements) that contains the stress values (σ_xx, σ_yy, σ_zz, τ_xy, τ_yz, τ_xz) of FEM-Elements. These stress values are given in the global xyz-coordinate space of the model. I want to calculate the main axis stress values and directions from those. If you're not that familiar with the physics behind it, this means taking the symmetric matrix
| σ_xx τ_xy τ_xz |
| τ_xy σ_yy τ_yz |
| τ_xz τ_yz σ_zz |
and calculating its eigenvalues and eigenvectors. Calculating each set of eigenvalues and -vectors on its own is too slow. I'm looking for a library, an algorithm or something in Java that would allow me to do this as array calculations. As an example, in python/numpy I could just take all my 3x3-matrices, stack them along a third dimension to get a nx3x3-array, and pass that to np.linalg.eig(arr), and it automatically gives me an nx3-array for the three eigenvalues and an nx3x3-array for the three eigenvectors.
Things I tried:
nd4j has an Eigen-module for calculating eigenvalues and -vectors, but only supports a single square array at a time.
Calculate the characteristic polynomial and use cardanos formula to get the roots/eigenvalues - possible to do for the whole array at once, but I'm stuck now on how to get the corresponding eigenvectors. Is there maybe a general simple algorithm to get from those to the eigenvectors?
Looking for an analytical form of the eigenvalues and -vectors that can be calculated directly: It does exist, but just no.

You'll need to write a little code.
I'd create or use a Matrix class as a dependency and find methods to give you eigenvalues and eigenvectors. The ones you found in nd4j sound like great candidates. You might also consider the Linear Algebra For Java (LA4J) dependency.
Load the dataset into a List<Matrix>.
Use functional Java methods to apply a map to give you a List of eigenvalues as a vector per stress matrix and a List of eigenvectors as a matrix per stress matrix.
You can optimize this calculation to the greatest extent possible by applying the map function to a stream. Java will parallelize the calculation under the covers to leverage available cores to the greatest extent possible.

Follow-up: This is the way that worked best for me, as I can do all operations without iterating over every element. As stated above, I'm using Nd4j, which seems to be limited in its possibilities compared to numpy (or maybe I just didn't read the documentation thoroughly enough). The following method uses only basic array operations:
From the given stress values, calculate the eigenvalues using Cardano's formula. Only element wise instructions are needed to do that (add, sub, mul, div, pow). The result should be three vectors of size n, each containing one eigenvalue for all elements.
Use the formula given here to calculate the matrix S for each eigenvalue. Like step 1, this can obviously also be done using only element-wise operations with the stress value- and eigenvalue-vectors, in order to avoid specifiying some complicated instructions on which array to multiply according to which axis while keeping whatever other axis.
Take one column from S and normalize it to get a normalized eigenvector for the given eigenvalue.
Note that this method only works if you have a real symmetric matrix. You also should make sure to properly deal with cases where the same eigenvalue appears multiple times.

Fast string retrieval based on metric distance in Java

Given an arbitrary string s, I would like a method to quickly retrieve all strings S ⊆ M from a large set of strings M (where |M| > 1 million), where all strings of S have minimal edit distance < t (some minimum threshold) from s.
At worst, S may be empty if no strings in M match this criteria, and at best, S = {s} (an exact match). For any case in between, I completely expect that S may be quite large.
In general, I expect to have the maximum edit distance threshold fixed (e.g., 2), and need to perform this operation very many times over arbitrary strings s, thus the need for an efficient method, as naively iterating and testing all strings would be too expensive.
While I have used edit distance as an example metric, I would like to use other metrics as well, such as the Jaccard index.
Can anyone make a suggestion about an existing Java implementation which can achieve this, or point me to the right algorithms and data structures for solving this problem?
UPDATE #1
I have since learned that Metric trees are precisely the kind of structure I am after, which exploits the distance metric to organise subsets of strings in M based on their distance from each other with the metric. Both Vantage-Point, BK and other similar metric tree data structures and algorithms seem ideal for this kind of problem. Now, to find easy-to-use implementations in Java...
UPDATE #2
Using a combination of this bk-tree and this Levenshtein distance implementation, I'm successfully able to retrieve subsets against arbitrary strings from a set (M) of one million strings with retrieval times of around 10ms.

BK trees are designed for such a case. It works with metric distance, such as Levenshtein or Jaccard index.

Although I never tried it myself, it might be worth looking at a Levenshtein Automaton. I once bookmarked this article, which looks rather elaborate and provides several code snippets:
Damn Cool Algorithms: Levenshtein Automata
As already mentioned by H W you will not be able to avoid checking each word in your dictionary. However, the automaton will speed up calculating the distance. Combine this with an efficient data structure for your dictionary (e.g. a Trie, as mentioned in the Wikipedia article), and you might be able to accelerate you current approach.

Java library that has matlab equivalent functions

I am looking for a Java library that closely mirrors matlab's Matrix functions and possibly other functions in the areas of polynomial interpolation, etc.
If such a library does not exist I was toying with the idea of building my own but using an existing Matrix or scientific computing library to do the heavy lifting - if I were to do that which libraries would be candidates to serve as backends for such an effort

Eigen, one of the most used (and fastest) library for matrix computation in C++, has a java wrapper: jeigen.
It allows one to manipulate full and sparse matrices and make operations one them. It can be also worth trying.

Check out the following resources/packages
http://math.nist.gov/javanumerics/jama/
http://www.jscience.org/

Try to look at la4j (Linear Algebra for Java). It supports dense matrices as well as sparse ones. Here is just a brief example of using functional features of la4j:
// reads the dense matrix from the CSV file
Matrix a = new Basic2DMatrix(Mattrices.asSymbolSeparatedSource("matrix.csv", ","));
// calculates the sum of all elements of the matrix 'a'
double sum = a.fold(Matrices.asSumAccumulator(0));
// creates a new matrix 'b', that contains elements of matrix 'a' multiplied by '2'.
Matrix b = a.transform(Matrices.asMulFunction(2));
The best way to get the last version of la4j - visit it's GitHub page.

I use the Colt library for matrix operations.
See in: http://acs.lbl.gov/software/colt/api/index.html
I think it's really good and easy to use and is better than Apache Commons-Math and EJML that I have already tried.
I suggest you try all of the libraries mentioned and choose the one that is closer to your needs.

Lucene Scoring Function - bias towards shorter documents

I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene
I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:
"boost" that shorter length posts - using doc.getBoost()
"lengthNorm" in the definition of norm(t,d)
Here is the documentation
I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?

Using BM25Similarity you could reduce to 0f:
#param b Controls to what degree document length normalizes tf values
or
#param k1 Controls non-linear term frequency normalization (saturation).
Both params will affect SimWeight
indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));
More explanation can be found here : http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

Shorter docs are meant to be more relevant when you use TF-IDF scoring.
You can use your custom scoring functions in Lucene. Its easy to customize the scoring algorithm. Subclass DefaultSimilarity and override the method you want to customize.
There's a code sample here that will help you implement it

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.
Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:
Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.
My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?
Thanks,
Chander

Check out merge-sort.
It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.
If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.
Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.

As others have mentioned, merging is much simpler than sorting, so there's a big win there.
However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.
One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.

So the simplest way to sort using map-reduce (though the not the most efficient one) is to do the following
During the Map Phase
(Input_Key, Input_Value) emit out (Input_Value,Input Key)
Reducer is an Identity Reduceer
So for example if our data is a student, age database then your mapper input would be
('A', 1) ('B',2) ('C', 10) ... and the output would be
(1, A) (2, B) (10, C)
Haven't tried this logic out but it is step in a homework problem I am working on. Will put an update source code/ logic link.

Sorry for being late but for future readers, yes, Chander, you are missing something.
Logic is that Reducer can handle shuffled and then sorted data of its node only on which it is running. I mean reducer that run at one node can't look at other node's data, it applies the reduce algorithm on its data only. So merging procedure of merge sort can't be applied.
So for big data we use TeraSort, which is nothing but identity mapper and reducer with custom partitioner. You can read more about it here Hadoop's implementation for TeraSort. It states:
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."

I think, combining multiple sorted items is efficient than combining multiple unsorted items. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.

Sorting can be efficiently implemented using MapReduce. But you seem to be thinking about implementing merge-sort using mapreduce to achieve this purpose. It may not be the ideal candidate.
Like you alluded to, the mergesort (with map-reduce) would involve following steps:
Partition the elements into small groups and assign each group to the mappers in round robin manner
Each mapper will sort the subset and return {K, {subset}}, where K is same for all the mappers
Since same K is used across all mappers, only one reduce and hence only one reducer. The reducer can merge the data and return the sorted result
The problem here is that, like you mentioned, there can be only one reducer which precludes the parallelism during reduction phase. Like it was mentioned in other replies, mapreduce specific implementations like terasort can be considered for this purpose.
Found the explanation at http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf
Coming back to merge-sort, this would be feasible if the hadoop (or equivalent) tool provides hierarchy of reducers where output of one level of reducers goes to the next level of reducers or loop it back to the same set of reducers

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.