I am implementing k-means and I want to create the new centroids. But the mapping leaves one element out! However, when K is of a smaller value, like 15, it will work fine.
Based on that code I have:
val K = 25 // number of clusters
val data = sc.textFile("dense.txt").map(
t => (t.split("#")(0), parseVector(t.split("#")(1)))).cache()
val count = data.count()
println("Number of records " + count)
var centroids = data.takeSample(false, K, 42).map(x => x._2)
do {
var closest = data.map(p => (closestPoint(p._2, centroids), p._2))
var pointsGroup = closest.groupByKey()
println(pointsGroup)
pointsGroup.foreach { println }
var newCentroids = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
//var newCentroids = pointsGroup.mapValues(ps => average(ps)).collectAsMap() this will produce an error
println(centroids.size)
println(newCentroids.size)
for (i <- 0 until K) {
tempDist += centroids(i).squaredDist(newCentroids(i))
}
..
and in the for loop, I will get the error that it won't find the element (which is not always the same and it depends on K:
java.util.NoSuchElementException: key not found: 2
Output before the error comes up:
Number of records 27776
ShuffledRDD[5] at groupByKey at kmeans.scala:72
25
24 <- IT SHOULD BE 25
What is the problem?
>>> println(newCentroids)
Map(23 -> (-0.0050852959701492536, 0.005512245104477607, -0.004460964477611937), 17 -> (-0.005459583045685268, 0.0029015278781725795, -8.451635532994901E-4), 8 -> (-4.691649213483123E-4, 0.0025375451685393366, 0.0063490755505617585), 11 -> (0.30361112034069937, -0.0017342255382385204, -0.005751167731061906), 20 -> (-5.839587918939964E-4, -0.0038189763756820145, -0.007067070459859708), 5 -> (-0.3787612396704685, -0.005814121628643806, -0.0014961713117870657), 14 -> (0.0024755681263616547, 0.0015191503267973836, 0.003411769193899781), 13 -> (-0.002657690932944597, 0.0077671050923225635, -0.0034652379980563263), 4 -> (-0.006963114731610361, 1.1751361829025871E-4, -0.7481135105367823), 22 -> (0.015318187079953534, -1.2929035958285013, -0.0044176372190034684), 7 -> (-0.002321059060773483, -0.006316359116022083, 0.006164669723756913), 16 -> (0.005341800955165691, -0.0017540737037037035, 0.004066574093567247), 1 -> (0.0024547379611650484, 0.0056298656504855955, 0.002504618082524296), 10 -> (3.421068671121009E-4, 0.0045169004751299275, 5.696239049740164E-4), 19 -> (-0.005453716071428539, -0.001450277556818192, 0.003860007248376626), 9 -> (-0.0032921685273631807, 1.8477108457711313E-4, -0.003070412228855717), 18 -> (-0.0026803160958904053, 0.00913904078767124, -0.0023528013698630146), 3 -> (0.005750011594202901, -0.003607098309178754, -0.003615918896940412), 21 -> (0.0024925166025641056, -0.0037607353461538507, -2.1588444871794858E-4), 12 -> (-7.920202960526356E-4, 0.5390774232894769, -4.928884539473694E-4), 15 -> (-0.0018608492323232324, -0.006973787272727284, -0.0027266663434343404), 24 -> (6.151173211963486E-4, 7.081812613784045E-4, 5.612962808842611E-4), 6 -> (0.005323933953732931, 0.0024014750473186123, -2.969338590956889E-4), 0 -> (-0.0015991676750160377, -0.003001317289659613, 0.5384176139563245))
Question with relevant error: spark scala throws java.util.NoSuchElementException: key not found: 0 exception
EDIT:
After the observation of zero323 that two centroids were the same, I changed the code so that all the centroids are unique. However, the behaviour remains the same. For that reason, I suspect that closestPoint() may return the same index for two centroids. Here is the function:
def closestPoint(p: Vector, centers: Array[Vector]): Int = {
var index = 0
var bestIndex = 0
var closest = Double.PositiveInfinity
for (i <- 0 until centers.length) {
val tempDist = p.squaredDist(centers(i))
if (tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
return bestIndex
}
How to get away with this? I am running the code like I describe in Spark cluster.
It can happen in the "E-step" (the assignment of points to cluster-indices is analogous to the E-step of the EM algorithm) that one of your indices will not be assigned any points. If this happens then you need to have a way of associating that index with some point, otherwise you're going to wind up with fewer clusters after the "M-step" (the assignment of centroids to the indices is analogous to the M-step of the EM algorithm.) Something like this should probably work:
val newCentroids = {
val temp = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
val nMissing = K - temp.size
val sample = data.takeSample(false, nMissing, seed)
var c = -1
(for (i <- 0 until K) yield {
val point = temp.getOrElse(i, {c += 1; sample(c) })
(i, point)
}).toMap
}
Just substitute that code for the line you are currently using to compute newCentroids.
There are other ways of dealing with this issue and the approach above is probably not the best (is it a good idea to be calling takeSample multiple times, once for each iteration of the the k-means algorithm? what if data contains a lot of repeated values?, etc.), but it is a simple starting point.
By the way, you might want to think about how you can replace the groupByKey with a reduceByKey.
Note: For the curious, here's a reference describing the similarities between the EM-algorithm and the k-means algorithm: http://papers.nips.cc/paper/989-convergence-properties-of-the-k-means-algorithms.pdf.
Related
I have below data smaple data but in real life this dataset is huge.
A B 1-1-2018 10
A B 2-1-2018 20
C D 1-1-2018 15
C D 2-1-2018 25
I need to group by above data using date and generate key pair values
1-1-2018->key
-----------------
A B 1-1-2018 10
C D 1-1-2018 15
2-1-2018->key
-----------------
A B 2-1-2018 20
C D 2-1-2018 25
Can anyone please tell me how can we do that in spark in best optimize way (using java if possible )
Not Java but looking at your code above it seems you wants recursively set your dataframes into sub-groups by Key. The best way I know how to do it is by a while loop and its not the easiest on the planet earth.
//You will also need to import all DataFrame and Array data types in Scala, don't know if you need to do it for Java for the below code.
//Inputting your DF, with columns as Value_1, Value_2, Key, Output_Amount
val inputDF = //DF From above
//Need to get an empty DF, I just like doing it this way
val testDF = spark.sql("select 'foo' as bar")
var arrayOfDataFrames = Array[DataFrame] = Array(testDF)
val arrayOfKeys = inputDF.selectExpr("Key").distinct.rdd.map(x=>x.mkString).collect
var keyIterator = 1
//Need to overwrite the foo bar first DF
arrayOfDataFrames = Array(inputDF.where($""===arrayOfKeys(keyIterator - 1)))
keyIterator = keyIterator + 1
//loop through find the key and place it into the DataFrames array
while(keyIterator <= arrayOfKeys.length) {
arrayOfDataFrames = arrayOfDataFrames ++ Array(inputDF.where($"Key"===arrayOfKeys(keyIterator - 1)))
keyIterator = keyIterator + 1
}
At the end of the command you will have two array of same length DataFrames and Keys that match. Meaning if you select the 3rd element of the Keys it matches the 3rd element of the DataFrames.
Since this isn't Java and doesn't directly answer your question, does this at least help push you in a direction that might help (I built it in Spark Scala).
This question already has answers here:
Is it possible to use Streams.intRange function?
(3 answers)
Closed 6 years ago.
I have an old style for loop to do some load tests:
For (int i = 0 ; i < 1000 ; ++i) {
if (i+1 % 100 == 0) {
System.out.println("Test number "+i+" started.");
}
// The test itself...
}
How can I use new Java 8 stream API to be able to do this without the for?
Also, the use of the stream would make it easy to switch to parallel stream. How to switch to parallel stream?
* I'd like to keep the reference to i.
IntStream.range(0, 1000)
/* .parallel() */
.filter(i -> i+1 % 100 == 0)
.peek(i -> System.out.println("Test number " + i + " started."))
/* other operations on the stream including a terminal one */;
If the test is running on each iteration regardless of the condition (take the filter out):
IntStream.range(0, 1000)
.peek(i -> {
if (i + 1 % 100 == 0) {
System.out.println("Test number " + i + " started.");
}
}).forEach(i -> {/* the test */});
Another approach (if you want to iterate over an index with a predefined step, as #Tunaki mentioned) is:
IntStream.iterate(0, i -> i + 100)
.limit(1000 / 100)
.forEach(i -> { /* the test */ });
There is an awesome overloaded method Stream.iterate(seed, condition, unaryOperator) in JDK 9 which perfectly fits your situation and is designed to make a stream finite and might replace a plain for:
Stream<Integer> stream = Stream.iterate(0, i -> i < 1000, i -> i + 100);
You can use IntStream as shown below and explained in the comments:
(1) Iterate IntStream range from 1 to 1000
(2) Convert to parallel stream
(3) Apply Predicate condition to allow integers with (i+1)%100 == 0
(4) Now convert the integer to a string "Test number "+i+" started."
(5) Output to console
IntStream.range(1, 1000). //iterates 1 to 1000
parallel().//converts to parallel stream
filter( i -> ((i+1)%100 == 0)). //filters numbers & allows like 99, 199, etc..)
mapToObj((int i) -> "Test number "+i+" started.").//maps integer to String
forEach(System.out::println);//prints to the console
I am working on a formulation of transportation problem through Linear Programming. Primarily I searched it on the web and found a code which is written in Java. But, I have to write whole stuff in Python. And I am converting it into Python. I don't claim myself good at Java nay at Python. I tried to convert a bit. Everything is fine, but I don't know how to convert the snippet below, it deals with LinkedLists and Stream functions of Java.
static LinkedList<Shipment> matrixToList() {
return stream(matrix)
.flatMap(row -> stream(row))
.filter(s -> s != null)
.collect(toCollection(LinkedList::new));
}
If you are interested in looking into how I converted above linked Java code, here you can see the Shipment class below is my (incomplete) Python code:
import sys
class TransportationProblem:
demand = list()
supply = list()
costs = list(list())
matrix = list(list())
def __init__(self):
pass
class Shipment:
costPerUnit = 0.0
quantity = 0.0
r = 0
c = 0
def __init__(self, quantity, costPerUnit, r, c):
self.quantity = quantity
self.costPerUnit = costPerUnit
self.r = r
self.c = c
def init(self, f_name= ""):
try:
with open(f_name) as f:
val = [int(x) for x in f.readline().strip().split(' ')]
numSources, numDestinations = val[0], val[1]
src = list()
dst = list()
val = [int(x) for x in f.readline().strip().split(' ')]
for i in range(0,numSources):
src.append(val[i])
val = [int(x) for x in f.readline().strip().split(' ')]
for i in range(0, numDestinations):
dst.append(val[i])
totalSrc = sum(src)
totalDst = sum(dst)
if totalSrc > totalDst:
dst.append(totalSrc - totalDst)
elif totalDst > totalSrc:
src.append(totalDst - totalSrc)
self.supply = src
self.demand = dst
self.costs = [[0 for j in range(len(dst))] for i in range(len(src))]
self.matrix = [[self.Shipment() for j in range(len(dst))] for i in range(len(src))]
for i in range(0,len(src)):
val = [int(x) for x in f.readline().strip().split(' ')]
for j in range(0, len(dst)):
self.costs[i][j] = val[j]
print self.costs
except IOError:
print "Error: can\'t find file or read data"
def northWestCornerRule(self):
northwest = 0
for r in range(0, len(self.supply)):
for c in range(northwest, len(self.demand)):
quantity = min(self.supply[r], self.demand[c])
if quantity > 0:
self.matrix[r][c] = self.Shipment(quantity=quantity, costPerUnit=self.costs[r][c], r=r, c=c)
self.supply[r] = self.supply[r] - quantity
self.demand[c] = self.demand[c] - quantity
if self.supply[r] == 0:
northwest = c
break
def steppingStone(self):
maxReduction = 0
move = []
leaving = self.Shipment()
self.fixDegenerateCase()
for r in range(0,len(self.supply)):
for c in range(0,len(self.demand)):
if self.matrix[r][c] != None:
pass
trail = self.Shipment(quantity=0, costPerUnit=self.costs[r][c], r=r, c=c)
path = self.geClosedPath(trail)
reduction = 0
lowestQuantity = sys.maxint
leavingCandidate = None
plus = True
for s in path:
if plus == True:
reduction = reduction + s.costPerUnit
else:
reduction = reduction - s.costPerUnit
if s.quantity < lowestQuantity:
leavingCandidate = s
lowestQuantity = s.quantity
plus = not plus
if reduction < maxReduction:
move = path
leaving = leavingCandidate
maxReduction = reduction
if move != None:
q = leaving.quantity
plus = True
for s in move:
s.quantity = s.quantity + q if plus else s.quantity - q
self.matrix[s.r][s.c] = None if s.quantity == 0 else s
plus = not plus
self.steppingStone()
def fixDegenerateCase(self):
pass
def getClosedPath(self):
pass
def matrixToList(self):
pass
We can break this into steps. You start with a matrix variable, which is some iterable that contains iterables of type Shipment.
To stream an object means that you perform an action on each element of the stream.
A map on a stream means that you take each object, say of type A, and transform it to some type B. A flatMap is a special case used when a map would produce Stream<B>. flatMap lets you concatenate these streams into a single stream.
Say each A maps to a stream of 3 objects {A1, A2} -> {{B11, B12, B13}, {B21, B22, B23}}
flatMap will make this one stream {A1, A2} -> {B11, B12, B13, B21, B22, B23}
In this case a matrix produces a stream of row objects. Each row is mapped into a stream of Shipment and flatMap is used to concatenate them.
Finally filter is used to remove empty shipments (ie value is null) and the collect method is called to transform the stream of Shipment into a List.
Recreating this without streams might look like bellow:
static LinkedList<Shipment> matrixToList() {
LinkedList<Shipment> result = new LinkedList<>();
for (List<Shipment> row : matrix) {
for (Shipment shipment : row) {
if (shipment != null) {
result.add(shipment );
}
}
}
return result;
}
I am just practicing lamdas java 8. My problem is as follows
Sum all the digits in an integer until its less than 10(means single digit left) and checks if its 1
Sample Input 1
100
Sample Output 1
1 // true because its one
Sample Input 2
55
Sample Output 2
1 ie 5+5 = 10 then 1+0 = 1 so true
I wrote a code
System.out.println(Arrays.asList( String.valueOf(number).split("") ).stream()
.map(Integer::valueOf)
.mapToInt(i->i)
.sum() == 1);
It works for the input 1 ie 100 but not for input 2 ie 55 which I clearly understand that in second case 10 is the output because the iteration is not recursive .
So how can I make this lambdas expression recursive so that it can work in second case also? I can create a method with that lambda expression and call it each time until return value is< 10 but I was thinking if there is any approach within lambdas.
Thanks
If you want a pure lambda solution, you should forget about making it recursive, as there is absolutely no reason to implement an iterative process as a recursion:
Stream.iterate(String.valueOf(number),
n -> String.valueOf(n.codePoints().map(Character::getNumericValue).sum()))
.filter(s -> s.length()==1)
.findFirst().ifPresent(System.out::println);
Demo
Making lambdas recursive in Java is not easy because of the "variable may be uninitialized" error, but it can be done. Here is a link to an answer describing one way of doing it.
When applied to your task, this can be done as follows:
// This comes from the answer linked above
class Recursive<I> {
public I func;
}
public static void main (String[] args) throws java.lang.Exception {
Recursive<Function<Integer,Integer>> sumDigits = new Recursive<>();
sumDigits.func = (Integer number) -> {
int s = Arrays.asList( String.valueOf(number).split("") )
.stream()
.map(Integer::valueOf)
.mapToInt(i->i)
.sum();
return s < 10 ? s : sumDigits.func.apply(s);
};
System.out.println(sumDigits.func.apply(100) == 1);
System.out.println(sumDigits.func.apply(101) == 1);
System.out.println(sumDigits.func.apply(55) == 1);
System.out.println(sumDigits.func.apply(56) == 1);
}
I took your code, wrapped it in { ... }s, and added a recursive invocation on the return line.
Demo.
I'm writing a small program which will convert a very large file into multiple smaller files, each file will contain 100 lines.
I'm iterating over a lines iteration :
while (lines.hasNext) {
val line = lines.next()
}
I want to introduce a counter and when it reaches a certain value, reset the counter and proceed. In java I would do something like :
int counter = 0;
while (lines.hasNext) {
val line = lines.next()
if(counter == 100){
counter = 0;
}
++counter
}
Is there something similar in scala or an alternative method ?
traditionally in scala you use .zipWithIndex
scala> List("foo","bar")
res0: List[java.lang.String] = List(foo, bar)
scala> for((x,i) <- res0.zipWithIndex) println(i + " : " +x)
0 : foo
1 : bar
(this will work with your lines too, as far as they are in Iterator, e.g. has hasNext and next() methods, or some other scala collection)
But if you need a complicated logic, like resetting counter, you may write it the same way as in java:
var counter = 0
while (lines.hasNext) {
val line = lines.next()
if(counter % 100 == 0) {
// now write to another file
}
}
Maybe you can tell us why you want to reset counter, so we may say how to do it better?
EDIT
according to your update, that is better to do using grouped method, as #pr1001 proposed:
lines.grouped(100).foreach(l => l.foreach(/* write line to file*/))
If your resetting counter represents the fact that there are repeated groups of data in the original list, you might want to use the grouped method:
scala> val l = List("one", "two", "three", "four")
l: List[java.lang.String] = List(one, two, three, four)
scala> l.grouped(2).toList
res0: List[List[java.lang.String]] = List(List(one, two), List(three, four))
Update: Since you're reading from a file, you should be able to pretty efficiently iterate over the file:
val bigFile = io.Source.fromFile("/tmp/verybigfile")
val groupedLines = bigFile.getLines.grouped(2).zipWithIndex
groupedLines.foreach(group => {
val (lines, index) = group
val p = new java.io.PrintWriter("/tmp/" + index)
lines.foreach(p.println)
p.close()
})
Of course this could also be written as a for comprehension...
You might even be able to get better performance by converting groupedLines to a parallel collection with .par before writing out each group of lines to its own file.
This would work:
lines grouped 100 flatMap (_.zipWithIndex) foreach {
case (line, count) => //whatever
}
You may use zipWithIndex along with some transformation.
scala> List(10, 20, 30, 40, 50).zipWithIndex.map(p => (p._1, p._2 % 3))
res0: List[(Int, Int)] = List((10,0), (20,1), (30,2), (40,0), (50,1))