How to handle kmeans when a cluster has zero elements in it - java

I'm trying to implement KMeans in Java and have encountered a case that throws all of my results out. This happens when, given some randomly chosen initialized centroids, the data gets into a state where one of the centroids doesn't actually define a cluster. For example, if K=3, it could be that 2 of the centroids are closer to all of the data points, in which case during that iteration, I will only have 2 clusters instead of 3.
As I understand KMeans though, when we reset the centroids we need to sum up all of the data points per cluster and divide by the size of the cluster (to get the average). So, this means that we would have a cluster of size 0 and would get our new centroid to be
[0/0, 0/0, ... 0/0]
I have 2 questions about handling this case:
(1) How would we possibly recover from this if we've lost one of our clusters?
(2) Is there some way to account for the division by 0?
The code I have for this logic is as follows:
// do the sums
for (int i = 0; i < numDocuments; i++) {
int value = label[i]; // get the document's label (i.e. 0, 1, 2)
for (int j = 0; j < numWords; j++) {
tempCentroids[value][j] += data[i][j];
}
tally[value]++;
}
// get the average
for (int i = 0; i < k; i++) {
for (int j = 0; j < numWords; j++) {
tempCentroids[i][j] /= (double) tally[i]; // could have division by zero
System.out.println("tally[i] for centroid " + k + " is " + tally[i]);
}
}
Thanks in advance,

“For example, if K=3, it could be that 2 of the centroids are closer to all of the data points, in which case during that iteration, I will only have 2 clusters instead of 3”
I think you can always keep the centroid you chose for the third cluster to be in the third cluster and not in some other cluster. That way, you maintain the number of clusters and you don’t run into the weird case you mentioned. (I am assuming you chose the random centroids to be actual K data points from your dataset)
You might also want to look at K-means ++ algorithm which is the same as the Kmeans algorithm except for the initialization of the cluster-center step. This will lead to (probably) better classifications.

Related

Time complexity on iterative and recursive solution

I'm trying to solve the following problem:
I feel like I've given it a lot of thoughts and tried a lot of stuff. I manage to solve it, and produce correct values but the problem is that it isn't time efficient enough. It completes 2 out of the Kattis tests and fails on the 3 because of the time limit 1 second was exceeded. There is noway for me to see what the input was that they tested with I'm afraid.
I started out with a recursive solution and finished that. But then I realised that it wasn't time efficient enough so I instead tried to switch to an iterative solution.
I start with reading input and add those to an ArrayList. And then I call the following method with target as 1000.
public static int getCorrectWeight(List<Integer> platesArr, int target) {
/* Creates two lists, one for storing completed values after each iteration,
one for storing new values during iteration. */
List<Integer> vals = new ArrayList<>();
List<Integer> newVals = new ArrayList<>();
// Inserts 0 as a first value so that we can start the first iteration.
int best = 0;
vals.add(best);
for(int i=0; i < platesArr.size(); i++) {
for(int j=0; j < vals.size(); j++) {
int newVal = vals.get(j) + platesArr.get(i);
if (newVal <= target) {
newVals.add(newVal);
if (newVal > best) {
best = newVal;
}
} else if ((Math.abs(target-newVal) < Math.abs(target-best)) || (Math.abs(target-newVal) == Math.abs(target-best) && newVal > best)) {
best = newVal;
}
}
vals.addAll(newVals);
}
return best;
}
My question is, is there some way that I can reduce the time complexity on this one for large number of data?
The main problem is that the size of vals and newVals can grow very quickly, as each iteration can double their size. You only need to store 1000 or so values which should be manageable. You're limiting the values but because they're stored in an ArrayList, it ends up with a lot of duplicate values.
If instead, you used a HashSet, then it should help the efficiency a lot.
You only need to store a DP table of size 2001 (0 to 2000)
Let dp[i] represent if it is possible to form ikg of weights. If the weight goes over the array bounds, ignore it.
For example:
dp[0] = 1;
for (int i = 0; i < values.size(); i++){
for (int j = 2000; j >= values[i]; j--){
dp[j] = max(dp[j],dp[j-values[i]);
}
}
Here, values is where all the original weights are stored. All values of dp are to be set to 0 except for dp[0].
Then, check 1000 if it is possible to make it. If not, check 999 and 1001 and so on.
This should run in O(1000n + 2000) time, since n is at most 1000 this should run in time.
By the way, this is a modified knapsack algorithm, you might want to look up some other variants.
If you think too generally about this type of problem, you may think you have to check all possible combinations of input (each weight can be included or excluded), giving you 2n combinations to test if you have n inputs. This is, however, rather beside the point. Rather, the key here is that all weights are integers, and that the goal is 1000.
Let's examine corner cases first, because that limits the search space.
If all weights are >= 1000, pick the smallest.
If there is at least one weight < 1000, that is always better than any weight >= 2000, so you can ignore any weight >= 1000 for combination purposes.
Then, apply dynamic programming. Keep a set (you got HashSet as suggestion from other poster, but BitSet is even better since the maximum value in it is so small) of all combinations of the first k inputs, and increase k by combining all previous solutions with the k+1'th input.
When you have considered all possibilities, just search the bit vector for the best response.
static int count() {
int[] weights = new int[]{900, 500, 498, 4};
// Check for corner case to limit search later
int min = Integer.MAX_VALUE;
for (int weight : weights) min = Math.min(min, weight);
if (min >= 1000) {
return min;
}
// Get all interesting combinations
BitSet combos = new BitSet();
for (int weight : weights) {
if (weight < 1000) {
for (int t = combos.previousSetBit(2000 - weight) ; t >= 0; t = combos.previousSetBit(t-1)) {
combos.set(weight + t);
}
combos.set(weight);
}
}
// Pick best combo
for (int distance = 0; distance <= 1000; distance++) {
if (combos.get(1000 + distance)) {
return 1000 + distance;
}
if (combos.get(1000 - distance)) {
return 1000 - distance;
}
}
return 0;
}

Dynamic nested loops with dynamic bounds

I have a LinkedList< Point > points ,with random values:
10,20
15,30
13,43
.
.
I want to perform this kind of loop:
for (int i= points.get(0).x; i< points.get(0).y; i++){
for (int j= points.get(1).x; j< points.get(1).y; j++){
for (int k= points.get(2).x; k< points.get(2).y; k++){
...
...
}
}
}
How can I do that if I don't know the size of the list?
There's probably a better way to solve equations like that with less cpu and memory consumption but a brute-force approach like your's could be implemented via recursion or some helper structure to keep track of the state.
With recursion you could do it like this:
void permutate( List<Point> points, int pointIndex, int[] values ) {
Point p = points.get(pointIndex);
for( int x = p.x; x < p.y; x++ ) {
values[pointIndex] = x;
//this assumes pointIndex to be between 0 and points.size() - 1
if( pointIndex < points.size() - 1 ) {
permutate( points, pointIndex + 1; values );
}
else { //pointIndex is assumed to be equal to points.size() - 1 here
//you have collected all intermediate values so solve the equation
//this is simplified since you'd probably want to collect all values where the result is correct
//as well as pass the equation somehow
int result = solveEquation( values );
}
}
}
//initial call
List<Point> points = ...;
int[] values = new int[points.size()];
permutate( points, 0, values );
This would first iterate over the points list using recursive calls and advancing the point index by one until you reach the end of the list. Each recursive call would iterate over the point values and add the current one to an array at the respective position. This array is then used to calculate the equation result.
Note that this might result in a stack overflow for huge equations (the meaning of "huge" depends on the environment but is normally at several 1000 points). Performance might be really low if you check all permutations in any non-trivial case.

Java Vectors: how to quickly "symmetrify" a large chunk of a huge sparse matrix

I have a huge sparse matrix (about 500K x 500K entries, with approximately 1% of the values being non-zero.
I'm using #mikera's Vectorz library.
t is a SparseRowMatrix composed of SparseIndexedVector rows.
For this chunk of the matrix, I am computing weights for (i,j) where j>i, putting them into an array of double, then creating the SparseIndexedVector for the row from that array. I was trying to cache the weights so that for the parts of the row where j<i, I could look up the previously computed value for (j,i) and put that value in for (i,j), but that took too much memory. So I am now trying to basically just compute and fill in the upper triangle for that chunk of the matrix, and then "symmetrify" it later. The chunk is from n1 x n1 to n2 x n2 (where n2 - n1 =~ 100K).
Conceptually, this is what I need to do:
for (int i = n1; i < n2; i++) {
for (int j = i + 1; j < n2; j++) {
double w = t.get(i, j);
if (w > 0) {
t.set(j, i, w);
}
}
}
But the "random access" get and set operations are quite slow. I assume unsafeGet would be faster.
Would it improve my performance to do the j-loop as my outer loop and convert the row back to a double array, then add elements and then create a new SparseIndexedVector from that array and replaceRow it back in? Something like:
for (j = n1 + 1; j < n2; j++) {
double[] jRowData = t.getRow(j).asDoubleArray();
for (i = 1; i < j-1; i++) {
double w = t.unsafeGet(i,j);
if (w > 0) {
jRowData[i] = w;
}
}
SparseIndexedVector jRowVector = SparseIndexedVector.createLength(n);
jRowVector.setElements(jRowData);
t.replaceRow(j, jRowVector);
}
Would something like that likely be more efficient? (I haven't tried it yet, as testing things on such large arrays takes a long time, so I'm trying to get an idea of what is "likely" to work well first. I've tried various incarnations on a smaller array (1K x 1K), but I've found that what is faster on a small array is not necessarily the same as what is faster on a large array.)
Is there another approach I should take instead?
Also, since memory is also a large concern for me, would it be helpful at the end of the outer loop to release the array memory explicitly? Can I do that by adding jRowData = null;? I assume that would save time on GC but I'm not all that clear on how memory management works in Java (7 if it matters).
Thanks in advance for any suggestions you can provide.
A matrix is symmetric if m[x,y] is equal to m[y,x] for all x and y, so if you know that the matrix must be symmetric, storing both m[x,y] and m[y,x] is redundant.
You can avoid storing both by rearranging the inputs if they meet a certain condition:
void set(int row, int col, double value) {
if(col < row) {
//call set again with transposed col/row
return set(col, row, value);
//we now know that we're in the top half of the matrix, proceed like normal
...
}
Do something similar for the get method:
double get(int row, int col) {
if(col < row) {
//call get again with transposed col/row
return get(col, row);
//we now know that we're in the top half of the matrix, proceed like normal
...
return value;
}
This technique will allow you to both avoid storing redundant values, and force the matrix to be symmetric without having to do any extra processing.
The la4j library guarantees O(log n) (where the n is a dimension) performance for both get/set operations on sparse matrices like CRSMatrix or CCSMatrix. You can perform a small experiment - just compile your code from conceptual box and run it with la4j:
for (int i = n1; i < n2; i++) {
for (int j = i + 1; j < n2; j++) {
double w = t.get(i, j);
if (w > 0) {
t.set(j, i, w);
}
}
}
You said you have to handle 500K x 500K matrices. For this size you can expect log_2(500 000) ~ 18 internal binary search iterations will be performed at each call to get/set operations.
Just think about 18 loop iterations on a moderl JVM and modern CPU.
Hope this helps. Have fun.

Reservoir Sampling Algorithm

I want to understand the reservoir sampling algorithm where we select k elements out of the given set of S elements such that k <= S.
In the algorithm given on wiki:
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done
If I understand this correctly, we first select k elements from the set and then continuously parse i elements of S, generate the random no j in the range 1 to i and replace the element j with S[i].
It looks fine if the set K to be sampled is very large, but if I want to pick just 1 element from a linked list of infinite size(at least unknown size) at random, how will I do it with this algorithm...?
The reservoir sampling algorithm works on any sized linked list, even one whose length is unknown in advance. In fact, one of the main selling points of reservoir sampling is that it works on data streams whose size is not known in advance.
If you set k = 1 and then run the normal reservoir sampling algorithm, then you should correctly get a uniformly random element from the list.
Hope this helps!
I have implemented A different algorithm to solve this problem, here is my code
static char[] solution2(String stream, int K) {
HashSet<Integer> set = new HashSet();
char[] list = new char[K];
stream = stream.concat(stream2);
Random ran = new Random();
for (int i = 0; i < K; i++) {
int y = ran.nextInt(stream.length());
if (set.add(y)) {
list[i] = stream.charAt(y);
} else {
i--; //skip this iteration since its duplicate number
}
}
return list;
}
Instead of iterating over all the stream values, just pick a random values J and get N[J] from the stream.

Do running times match with O(nlogn)?

I have written a class(greedy strategy) that at first i used sort method which has O(nlogn)
Collections.sort(array, new
SortingObjectsWithProbabilityField());
and then i used the insert method of binary search tree which takes O(h) and h here is the tree height.
for different n ,the running time will be :
n,running time
17,515428
33,783340
65,540572
129,1285080
257,2052216
513,4299709
which I think is not correct because for increasing n , the running time should almost increase.
This method will take the running time:
Exponent = -1;
for(int n = 2;n<1000;n+=Math.pow(2,exponent){
for (int j = 1; j <= 3; j++) {
Random rand = new Random();
for (int i = 0; i < n; i++) {
Element e = new Element(rand.nextInt(100) + 1, rand.nextInt(100) + 1, 0);
for (int k = 0; k < i; k++) {
if (e.getDigit() == randList.get(k).getDigit()) {
e.setDigit(e.getDigit() + 1);
}
}
randList.add(e);
}
double sum = 0.0;
for (int i = 0; i < randList.size(); i++) {
sum += randList.get(i).getProbability();
}
for (Element i : randList) {
i.setProbability(i.getProbability() / sum);
}
//Get time.
long t2 = System.nanoTime();
GreedyVersion greedy = new GreedyVersion((ArrayList<Element>) randList);
long t3 = System.nanoTime();
timeForGreedy = timeForGreedy + t3 - t2;
}
System.out.println(n + "," + "," + timeForGreedy/3 );
exponent++;
}
thanks
Your data appears to roughly fit an order of nlogn, as we can see below. Notice that the curve is almost linear, as for large values of n, logn is pretty small. For example, for your largest value of n=513, logn is 9.003.
There are ways to achieve more accurate timings, which would likely make the curve fit the data points better. Such as taking a larger sample of random inputs (I'd advise at least 10, 100 if possible) and running multiple iterations per dataset (5 is an acceptable number) to smooth out the inaccuracies of the timer. You can use a single start/stop timer to time all iterations for the same n, and then divide by the number of runs, to get more accurate data points. Just be sure to first generate all data sets, store them all, and then run them all.
Good choice to sample n at powers of 2. You just might want to subtract 1 to make them exactly powers of 2, not that it makes any real impact.
For reference, here's the gnuplot script used to generate the plot:
set terminal png
set output 'graph.png'
set xrange [0:5000000]
set yrange [0:600]
f1(x) = a1*x*log(x)/log(2)
a1 = 1000
plot 'time.dat' title 'Actual runtimes', \
a1*x*log(x)/log(2) title 'Fitted curve: O(nlogn)
fit f1(x) 'time.dat' via a1
It's not that easy to relate asymptotic complexity to running times. When the sample is so small there are lots of things that will affect your timing.
To have more accurate timings you should run your algorithm K times per instance (e.g. K times with 17, K times with 33 and so forth) and take the average time as sample point (e.g. K=100)
That said it looks about right. You can plot nlog(n) vs your timings and you'll see that despite the different scales they are growing similarly. Still too little sample points to be sure...

Categories

Resources