Here is a simple project-reactor code snippet:
Consumer<String> slowConsumer = x -> {
try {
TimeUnit.MILLISECONDS.sleep(50);
} catch(Exception ignore) {
}
};
Flux<String> publisher = Flux
.just(1, 2, 3)
.parallel(2)
.runOn(Schedulers.newParallel("writer", 2))
.flatMap(rail -> Flux.range(1, 300).map(i -> String.format("Rail %d -> Row %d", rail, i)).log())
.sequential()
.publishOn(Schedulers.newSingle("reader"))
.doOnNext(slowConsumer);
publisher.subscribe();
My expectation is that everything that happens in "flatMap" should be executed within "writer" threads. However, this is what is logged:
onNext(Rail 2 -> Row 300) [Flux.MapFuseable.1] [writer-2]
...
onNext(Rail 1 -> Row 300) [Flux.MapFuseable.2] [writer-1]
...
onNext(Rail 3 -> Row 256) [Flux.MapFuseable.3] [writer-1]
request(256) [Flux.MapFuseable.3] [reader-1]
onNext(Rail 3 -> Row 257) [Flux.MapFuseable.3] [reader-1]
...
onNext(Rail 3 -> Row 300) [Flux.MapFuseable.3] [reader-1]
Can someone explain this? How come "reader" thread is processing the tail of the last rail? What am I missing?
I wrote this code:
Flux.range(0, 300)
.doOnNext(i -> System.out.println("i = " + i))
.flatMap(i -> Mono.just(i)
.subscribeOn(Schedulers.elastic())
.delayElement(Duration.ofMillis(1000))
)
.doOnNext(i -> System.out.println("end " + i))
.blockLast();
When running it, the first System.out.println shows that the Flux stop emitting numbers at the 256th element, then it waits for the older to be completed before emitting new ones.
Why is this happening?
Why 256?
Why this happening?
The flatMap operator can be characterized as operator that (rephrased from javadoc):
subscribes to its inners eagerly
does not preserve ordering of elements.
lets values from different inners interleave.
For this question the first point is important. Project Reactor restricts the
number of in-flight inner sequences via concurrency parameter.
While flatMap(mapper) uses the default parameter the flatMap(mapper, concurrency) overload accepts this parameter explicitly.
The flatMaps javadoc describes the parameter as:
The concurrency argument allows to control how many Publisher can be subscribed to and merged in parallel
Consider the following code using concurrency = 500
Flux.range(0, 300)
.doOnNext(i -> System.out.println("i = " + i))
.flatMap(i -> Mono.just(i)
.subscribeOn(Schedulers.elastic())
.delayElement(Duration.ofMillis(1000)),
500
// ^^^^^^^^^^
)
.doOnNext(i -> System.out.println("end " + i))
.blockLast();
In this case there is no waiting:
i = 297
i = 298
i = 299
end 0
end 1
end 2
In contrast if you pass 1 as concurrency the output will be similar to:
i = 0
end 0
i = 1
end 1
Awaiting one second before emitting the next element.
Why 256?
256 is the default value for concurrency of flatMap.
Take a look at Queues.SMALL_BUFFER_SIZE:
public static final int SMALL_BUFFER_SIZE = Math.max(16,
Integer.parseInt(System.getProperty("reactor.bufferSize.small", "256")));
Say I have this:
Flux<GroupedFlux<Integer, Integer>> intsGrouped = Flux.range(0, 12)
.groupBy(i -> i % 3);
and say I have a method:
Mono<Integer> getFromService(Integer i);
I want to call getFromService in parallel for each of the groups, but make sure the calls are serial within each group.
For the above example that would be three parallel streams with these input values:
stream 1: 0 -> 3 -> 6 -> 9
stream 2: 1 -> 4 -> 7 -> 10
stream 3: 2 -> 5 -> 8 -> 11
I tried this, but it's not doing what I want:
Flux.range(0, 12)
.groupBy(i -> i % 3)
.flatMap(g -> g.flatMap(i -> getFromService(g.key(), i)))
This is calling the service in parallel for all the ints at once. How do I proceed?
Use either concatMap or flatMapSequential instead of the inner .flatMap
If you want sequential execution within each group (i.e. only one subscription to getFromService at a single time within each group), then use .concatMap, like this:
Flux.range(0, 12)
.groupBy(i -> i % 3)
.flatMap(g -> g.concatMap(i -> getFromService(g.key(), i)))
If parallel execution within a group is ok, but you just care about the order in which the sequence is emitted, then use flatMapSequential, like this:
Flux.range(0, 12)
.groupBy(i -> i % 3)
.flatMap(g -> g.flatMapSequential(i -> getFromService(g.key(), i)))
Another option is to use .flatMap with the concurrency argument set to 1, but I'd recommend one of the above instead.
I have a code where I´m making an interval until a condition acomplish and then in the subscribe send back the result.
But since is an interval the subscription continue.
I was wondering if there´s any way to unsubscribe an Observable interval once emmit something
here the code
Subscription subscriber = Observable.interval(0, 5, TimeUnit.MILLISECONDS)
.map(i -> eventHandler.getProcessedEvents())
.filter(eventsProcessed -> eventsProcessed >= 10)
.doOnNext(eventsProcessed -> eventHandler.initProcessedEvents())
.doOnNext(eventsProcessed -> logger.info(null, "Total number of events processed:" + eventsProcessed))
.subscribe(t -> resumeRequest(asyncResponse));
new TestSubscriber((Observer) subscriber).awaitTerminalEvent(10, TimeUnit.SECONDS);
subscriber.unsubscribe();
For now as a hack I use a timer and then unsubscribe, but it´s bad!
Regards
You can use the first operator
Subscription subscriber = Observable.interval(0, 5, TimeUnit.MILLISECONDS)
.map(i -> eventHandler.getProcessedEvents())
.first(eventsProcessed -> eventsProcessed >= 10)
.doOnNext(eventsProcessed -> eventHandler.initProcessedEvents())
.doOnNext(eventsProcessed -> logger.info(null, "Total number of events processed:" + eventsProcessed))
.subscribe(t -> resumeRequest(asyncResponse));
instead of the filter. This ensures that you only get a single emission if your condition is met. Note that you will get an exception if your condition interval Observable terminates without your condition being met.
I am implementing k-means and I want to create the new centroids. But the mapping leaves one element out! However, when K is of a smaller value, like 15, it will work fine.
Based on that code I have:
val K = 25 // number of clusters
val data = sc.textFile("dense.txt").map(
t => (t.split("#")(0), parseVector(t.split("#")(1)))).cache()
val count = data.count()
println("Number of records " + count)
var centroids = data.takeSample(false, K, 42).map(x => x._2)
do {
var closest = data.map(p => (closestPoint(p._2, centroids), p._2))
var pointsGroup = closest.groupByKey()
println(pointsGroup)
pointsGroup.foreach { println }
var newCentroids = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
//var newCentroids = pointsGroup.mapValues(ps => average(ps)).collectAsMap() this will produce an error
println(centroids.size)
println(newCentroids.size)
for (i <- 0 until K) {
tempDist += centroids(i).squaredDist(newCentroids(i))
}
..
and in the for loop, I will get the error that it won't find the element (which is not always the same and it depends on K:
java.util.NoSuchElementException: key not found: 2
Output before the error comes up:
Number of records 27776
ShuffledRDD[5] at groupByKey at kmeans.scala:72
25
24 <- IT SHOULD BE 25
What is the problem?
>>> println(newCentroids)
Map(23 -> (-0.0050852959701492536, 0.005512245104477607, -0.004460964477611937), 17 -> (-0.005459583045685268, 0.0029015278781725795, -8.451635532994901E-4), 8 -> (-4.691649213483123E-4, 0.0025375451685393366, 0.0063490755505617585), 11 -> (0.30361112034069937, -0.0017342255382385204, -0.005751167731061906), 20 -> (-5.839587918939964E-4, -0.0038189763756820145, -0.007067070459859708), 5 -> (-0.3787612396704685, -0.005814121628643806, -0.0014961713117870657), 14 -> (0.0024755681263616547, 0.0015191503267973836, 0.003411769193899781), 13 -> (-0.002657690932944597, 0.0077671050923225635, -0.0034652379980563263), 4 -> (-0.006963114731610361, 1.1751361829025871E-4, -0.7481135105367823), 22 -> (0.015318187079953534, -1.2929035958285013, -0.0044176372190034684), 7 -> (-0.002321059060773483, -0.006316359116022083, 0.006164669723756913), 16 -> (0.005341800955165691, -0.0017540737037037035, 0.004066574093567247), 1 -> (0.0024547379611650484, 0.0056298656504855955, 0.002504618082524296), 10 -> (3.421068671121009E-4, 0.0045169004751299275, 5.696239049740164E-4), 19 -> (-0.005453716071428539, -0.001450277556818192, 0.003860007248376626), 9 -> (-0.0032921685273631807, 1.8477108457711313E-4, -0.003070412228855717), 18 -> (-0.0026803160958904053, 0.00913904078767124, -0.0023528013698630146), 3 -> (0.005750011594202901, -0.003607098309178754, -0.003615918896940412), 21 -> (0.0024925166025641056, -0.0037607353461538507, -2.1588444871794858E-4), 12 -> (-7.920202960526356E-4, 0.5390774232894769, -4.928884539473694E-4), 15 -> (-0.0018608492323232324, -0.006973787272727284, -0.0027266663434343404), 24 -> (6.151173211963486E-4, 7.081812613784045E-4, 5.612962808842611E-4), 6 -> (0.005323933953732931, 0.0024014750473186123, -2.969338590956889E-4), 0 -> (-0.0015991676750160377, -0.003001317289659613, 0.5384176139563245))
Question with relevant error: spark scala throws java.util.NoSuchElementException: key not found: 0 exception
EDIT:
After the observation of zero323 that two centroids were the same, I changed the code so that all the centroids are unique. However, the behaviour remains the same. For that reason, I suspect that closestPoint() may return the same index for two centroids. Here is the function:
def closestPoint(p: Vector, centers: Array[Vector]): Int = {
var index = 0
var bestIndex = 0
var closest = Double.PositiveInfinity
for (i <- 0 until centers.length) {
val tempDist = p.squaredDist(centers(i))
if (tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
return bestIndex
}
How to get away with this? I am running the code like I describe in Spark cluster.
It can happen in the "E-step" (the assignment of points to cluster-indices is analogous to the E-step of the EM algorithm) that one of your indices will not be assigned any points. If this happens then you need to have a way of associating that index with some point, otherwise you're going to wind up with fewer clusters after the "M-step" (the assignment of centroids to the indices is analogous to the M-step of the EM algorithm.) Something like this should probably work:
val newCentroids = {
val temp = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
val nMissing = K - temp.size
val sample = data.takeSample(false, nMissing, seed)
var c = -1
(for (i <- 0 until K) yield {
val point = temp.getOrElse(i, {c += 1; sample(c) })
(i, point)
}).toMap
}
Just substitute that code for the line you are currently using to compute newCentroids.
There are other ways of dealing with this issue and the approach above is probably not the best (is it a good idea to be calling takeSample multiple times, once for each iteration of the the k-means algorithm? what if data contains a lot of repeated values?, etc.), but it is a simple starting point.
By the way, you might want to think about how you can replace the groupByKey with a reduceByKey.
Note: For the curious, here's a reference describing the similarities between the EM-algorithm and the k-means algorithm: http://papers.nips.cc/paper/989-convergence-properties-of-the-k-means-algorithms.pdf.