I'm messing around with the rx operators and am curious why just(null).repeat() doesn't work as a parameter to any of the built-in operators:
Observable.interval(1, TimeUnit.SECONDS)
.sample(Observable.just(null).repeat())
.subscribe(System.out::println);
I would have expected this to print 0 1 2 3 ... but it just hangs. I imagine it's because the repeat is hogging the default Scheduler, however, if you swap the roles of interval and the just-repeat then it works as expected, printing null once per second:
Observable.just(null).repeat()
.sample(Observable.interval(1, TimeUnit.SECONDS))
.subscribe(System.out::println);
Whats going on here?
If you don't specify a scheduler (and no operator is setting one), then all processing happens on the same thread. just(null).repeat() will hog 100% of a CPU core, so nothing else gets a chance to proceed.
In your case, the interval gets produced on the Scedulers.computation() Scheduler, and because it's at the start and no scheduler changes happen afterwards, your repeat is also working on the same thread.
In the second case, everything gets subscribed on the same thread, except the interval, which is on its own scheduler; the rest depends on the internal implementation of sample.
If you use a specific scheduler, it should work:
.sample(Observable.just(null).repeat().subscribeOn(Schedulers.computation()))
Note that if you just want to use nulls instead of the numbers that interval produces, a much more efficient way is to use map instead of sample:
.map(any -> (Object) null)
Related
I am still trying to understand the difference between the reactor map() and flatMap() method.
First I took a look at the API, but it isn't really helpful, it confused me even more.
Then I googled a lot, but it seems like nobody has an example to make the differences understandable, if there are any differences.
Therefore I tried to write two tests to see the different behaviour for each methods.
But unfortunatley it isn't working as I hoped it would...
First test method is testing the reactive flatMap() method:
#Test
void fluxFlatMapTest() {
Flux.just(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
.window(2)
.flatMap(fluxOfInts -> fluxOfInts.map(this::processNumber).subscribeOn(Schedulers.parallel()))
.doOnNext(System.out::println)
.subscribe();
}
The output is as expected, explainable and looks like that:
9 - parallel-2
1 - parallel-1
4 - parallel-1
25 - parallel-3
36 - parallel-3
49 - parallel-4
64 - parallel-4
81 - parallel-5
100 - parallel-5
16 - parallel-2
The second method should test the output of the map() method to compare with above results of the flatMap() method.
#Test
void fluxMapTest() {
final int start = 1;
final int stop = 100;
Flux.just(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
.window(2)
.map(fluxOfInts -> fluxOfInts.map(this::processNumber).subscribeOn(Schedulers.parallel()))
.doOnNext(System.out::println)
.subscribe();
}
This test method has the output, I didn't expected at all and looks like that:
FluxSubscribeOn
FluxSubscribeOn
FluxSubscribeOn
FluxSubscribeOn
FluxSubscribeOn
There is a little helper method which looks like that:
private String processNumber(Integer x) {
String squaredValueAsString = String.valueOf(x * x);
return squaredValueAsString.concat(" - ").concat(Thread.currentThread().getName());
}
Nothing special here.
I am using Spring Boot 2.3.4 with Java 11 and the reactor implementation for Spring.
Do you have a good explaning example or do you know how to change the above tests so that they make sense?
Then please help me out with that.
Thanks a lot in advance!
Reactor which is the underlying library in Webflux consists of something called the event loop which in turn i believe is based on an architecture called the LMAX Architecture.
This means that the event loop is a single threaded event processer. Everything up to the event loop can be multithreaded but the events themselves are processed by a single thread. The event loop.
Regular spring boot applications are usually run using the server tomcat, or undertow, while webflux is per default run by the event driven server Netty, which in turn uses this event loop to process events for us.
So now that we understand what is underneath everything we can start talking about map and flatMap.
Map
If we look in the api we can see the following image:
and the api text says:
Transform the items emitted by this Flux by applying a synchronous function to each item.
Which is pretty self explanatory. We have a Flux of items, and each time map asks for an item to process it wont ask for another one until it has finished processing the first. Hence synchronous.
The image shows that, green circle needs to be converted to a green square, until we can ask for the the yellow circle to be converted to a yellow square... etc. etc.
here is a code example:
Flux.just("a", "b", "c")
.map(value -> value.toUppercase())
.subscribe(s -> System.out.println(s + " - " + Thread.currentThread().getName()));
// Output
A - main
B - main
C - main
Each are run on the main thread, and processed after each other synhronously.
flatMap
If we look in the api we can see the following image:
and the text says:
Transform the elements emitted by this Flux asynchronously into Publishers, then flatten these inner publishers into a single Flux through merging, which allow them to interleave.
it does this using basically three steps:
Generation of inners and subscription: this operator is eagerly subscribing to its inners.
Ordering of the flattened values: this operator does not necessarily preserve original ordering, as inner element are flattened as they arrive.
Interleaving: this operator lets values from different inners interleave (similar to merging the inner sequences).
So what does this mean? well it basically means that:
it will take each item in the flux, and transform it to individual Mono (publisher) with one item in each.
Order the items as they get processed, flatMap does NOT preserve order, as items can be processed in a different amount of time on the event loop.
Merging back all the processed items into a Flux for further processing down the line.
Here is a code example:
Flux.just("a", "b", "c")
.flatMap(value -> Mono.just(value.toUpperCase()))
.subscribe(s -> System.out.println(s + " - " + Thread.currentThread().getName()));
// Output
A - main
B - main
C - main
Wait flatMap printing the same thing as map!
Well, it all comes back to the threading model we talked about earlier. Actually there is only one thread called the event loop that handles all events.
Reactor is concurrent agnostic meaning that any worker can schedule jobs to be handled by the event loop.
So what is a worker well a worker is something a scheduler can spawn. And one important thing is that a worker doesn't have to be a thread, it can be, but it doesn't have to be.
In the above code cases, the main thread subscribes to our flux, which means that the main thread will process this for us and schedule work for the event loop to handle.
In a server environment this necessarily doesn't have to be the case. The important thing to understand here is that reactor can switch workers (aka possible threads) whenever it wants if it needs to.
In my above code examples there is only a main thread, so there is no need to run things on multiple threads, or have parallel execution.
If i wish to force it, i can use one of the different schedulers which all have their uses. In Netty, the server will start up will the same amount of event loop threads as cores on your machine, so there it can switch workers and cores freely if needed at heavy loads to maximize the usage of all event loops..
FlatMap being async does NOT mean parallel, it means that it will schedule all things to be processed by the event loop at the same time but its still only one thread executing the tasks.
Parallel execution
if i really want to execute something in parallel you can for instance place something on a parallel Scheduler. This means that it it will guarantee multiple workers on multiple cores. But remember there is a setup time for this when your program is run, and this is usually only beneficial if you have heavy computational stuff which in turn needs a lot of single core CPU power.
code example:
Flux.just("a", "b", "c")
.flatMap(value -> value -> Mono.just(value.toUpperCase()))
.subscribeOn(Schedulers.parallel())
.subscribe(s -> System.out.println(s + " - " + Thread.currentThread().getName()));
// Output
A - parallel-1
B - parallel-1
C - parallel-1
Here we are still running on just one thread, because subscribeOn means that when a thread subscribes the Scheduler will pick one thread from the scheduler pool and then stick with it throughout execution.
if we want to absolutely feel the need to force execution on multiple threads we can for instance use a parallel flux.
Flux.range(1, 10)
.parallel(2)
.runOn(Schedulers.parallel())
.subscribe(i -> System.out.println(Thread.currentThread().getName() + " -> " + i));
// Output
parallel-3 -> 2
parallel-2 -> 1
parallel-3 -> 4
parallel-2 -> 3
parallel-3 -> 6
parallel-2 -> 5
parallel-3 -> 8
parallel-2 -> 7
parallel-3 -> 10
parallel-2 -> 9
But remember this is in most cases not necessary. There is a setup time, and this type of execution is usually only beneficial if you have a lot of cpu heavy tasks. Otherwise using the default event loop single thread will in most cases "probably" be faster.
Dealing with a lot of i/o tasks, is usually more about orchestration, than raw CPU power.
Most of the information here is fetched from the Flux and Mono api.
the Reactor documentation is an amazing and interesting source of information.
also Simon Baslé's blog series Flight of the flux is also a wonderful and interesting read. It also exists in Youtube format
There is also some faults here and there and i have made some assumptions too especially when it comes to the inner workings of Reactor. But hopefully this will at least clear up some thoughts.
If someone feels things are direct faulty, feel free to edit.
I've encountered a problem I'm not sure how to solve.
I'm trying to parallelize a part of the code that we've done sequentially up until now. To do so I've divided the task into several smaller orthogonal tasks.
I've created an executorService and I'm running:
executorService.invokeAll(callableList, timeBudget, TimeUnit.NANOSECONDS);
Each callable is has several IO tasks within it (Like going to a database and external services) the overall time-budget is 200ms+-. The reason to use invokeAll is since I have an overall timeBudget for all of the request. Thus, I need a way to limit all the futures with a single budget.
In order to test myself I've added different metrics that report back to some logging visualisation tool that we have. I've noticed that:
The median (and 75th percentile) latency of that part of the code has faster.
The 95th+ percentiles has actually gotten worse.
After thorough investigation (Where I've benchmarked different parts of the code) I've noticed that invokeAll 99th percentile running time was actually 500ms and even more sometimes. This thing really screws up the optimization. Any ideas on what may cause this? Any other suggestions? Are there alternatives to invokeAll?
While I don't have answer to why invokeAll with timeouts sometimes takes much more time than the given budget. I have an answer to the question: How to run a list of futures simultaneously with a given budget?
ListeningExecutorService executor = <init>;
List<ListenableFuture<G>> futures = new ArrayList<>();
for (T chunk : chunks) {
futures.add(executorService.submit(() -> function(chunk, param1, param2, ...)));
}
Futures.allAsList(futures).get(budgetInNanos, TimeUnit.NANOSECONDS);
The code above has used Guava library.
The problem with this approach is that I'm not getting the status of each future, because I'm getting a timeout exception if the time is up - but at least in terms of budgeting the behaviour is as expected.
I want to use an accumulator to gather some stats about the data I'm manipulating on a Spark job. Ideally, I would do that while the job computes the required transformations, but since Spark would re-compute tasks on different cases the accumulators would not reflect true metrics. Here is how the documentation describes this:
For accumulator updates performed inside actions only, Spark
guarantees that each task’s update to the accumulator will only be
applied once, i.e. restarted tasks will not update the value. In
transformations, users should be aware of that each task’s update may
be applied more than once if tasks or job stages are re-executed.
This is confusing since most actions do not allow running custom code (where accumulators can be used), they mostly take the results from previous transformations (lazily). The documentation also shows this:
val acc = sc.accumulator(0)
data.map(x => acc += x; f(x))
// Here, acc is still 0 because no actions have cause the `map` to be computed.
But if we add data.count() at the end, would this be guaranteed to be correct (have no duplicates) or not? Clearly acc is not used "inside actions only", as map is a transformation. So it should not be guaranteed.
On the other hand, discussion on related Jira tickets talk about "result tasks" rather than "actions". For instance here and here. This seems to indicate that the result would indeed be guaranteed to be correct, since we are using acc immediately before and action and thus should be computed as a single stage.
I'm guessing that this concept of a "result task" has to do with the type of operations involved, being the last one that includes an action, like in this example, which shows how several operations are divided into stages (in magenta, image taken from here):
So hypothetically, a count() action at the end of that chain would be part of the same final stage, and I would be guaranteed that accumulators used on the last map will no include any duplicates?
Clarification around this issue would be great! Thanks.
To answer the question "When are accumulators truly reliable ?"
Answer : When they are present in an Action operation.
As per the documentation in Action Task, even if any restarted tasks are present it will update Accumulator only once.
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
And Action do allow to run custom code.
For Ex.
val accNotEmpty = sc.accumulator(0)
ip.foreach(x=>{
if(x!=""){
accNotEmpty += 1
}
})
But, Why Map+Action viz. Result Task operations are not reliable for an Accumulator operation?
Task failed due to some exception in code. Spark will try 4 times(default number of tries).If task fail every time it will give an exception.If by chance it succeeds then Spark will continue and just update the accumulator value for successful state and failed states accumulator values are ignored.Verdict : Handled Properly
Stage Failure : If an executor node crashes, no fault of user but an hardware failure - And if the node goes down in shuffle stage.As shuffle output is stored locally, if a node goes down, that shuffle output is gone.So Spark goes back to the stage that generated the shuffle output, looks at which tasks need to be rerun, and executes them on one of the nodes that is still alive.After we regenerate the missing shuffle output, the stage which generated the map output has executed some of it’s tasks multiple times.Spark counts accumulator updates from all of them.Verdict : Not handled in Result Task.Accumulator will give wrong output.
If a task is running slow then, Spark can launch a speculative copy of that task on another node.Verdict : Not handled.Accumulator will give wrong output.
RDD which is cached is huge and can't reside in Memory.So whenever the RDD is used it will re run the Map operation to get the RDD and again accumulator will be updated by it.Verdict : Not handled.Accumulator will give wrong output.
So it may happen same function may run multiple time on same data.So Spark does not provide any guarantee for accumulator getting updated because of the Map operation.
So it is better to use Accumulator in Action operation in Spark.
To know more about Accumulator and its issues refer this Blog Post - By Imran Rashid.
Accumulator updates are sent back to the driver when a task is successfully completed. So your accumulator results are guaranteed to be correct when you are certain that each task will have been executed exactly once and each task did as you expected.
I prefer relying on reduce and aggregate instead of accumulators because it is fairly hard to enumerate all the ways tasks can be executed.
An action starts tasks.
If an action depends on an earlier stage and the results of that stage are not (fully) cached, then tasks from the earlier stage will be started.
Speculative execution starts duplicate tasks when a small number of slow tasks are detected.
That said, there are many simple cases where accumulators can be fully trusted.
val acc = sc.accumulator(0)
val rdd = sc.parallelize(1 to 10, 2)
val accumulating = rdd.map { x => acc += 1; x }
accumulating.count
assert(acc == 10)
Would this be guaranteed to be correct (have no duplicates)?
Yes, if speculative execution is disabled. The map and the count will be a single stage, so like you say, there is no way a task can be successfully executed more than once.
But an accumulator is updated as a side-effect. So you have to be very careful when thinking about how the code will be executed. Consider this instead of accumulating.count:
// Same setup as before.
accumulating.mapPartitions(p => Iterator(p.next)).collect
assert(acc == 2)
This will also create one task for each partition, and each task will be guaranteed to execute exactly once. But the code in map will not get executed on all elements, just the first one in each partition.
The accumulator is like a global variable. If you share a reference to the RDD that can increment the accumulator then other code (other threads) can cause it to increment too.
// Same setup as before.
val x = new X(accumulating) // We don't know what X does.
// It may trigger the calculation
// any number of times.
accumulating.count
assert(acc >= 10)
I think Matei answered this in the referred documentation:
As discussed on https://github.com/apache/spark/pull/2524 this is
pretty hard to provide good semantics for in the general case
(accumulator updates inside non-result stages), for the following
reasons:
An RDD may be computed as part of multiple stages. For
example, if you update an accumulator inside a MappedRDD and then
shuffle it, that might be one stage. But if you then call map() again
on the MappedRDD, and shuffle the result of that, you get a second
stage where that map is pipeline. Do you want to count this
accumulator update twice or not?
Entire stages may be resubmitted if
shuffle files are deleted by the periodic cleaner or are lost due to a
node failure, so anything that tracks RDDs would need to do so for
long periods of time (as long as the RDD is referenceable in the user
program), which would be pretty complicated to implement.
So I'm going
to mark this as "won't fix" for now, except for the part for result
stages done in SPARK-3628.
Consider the following Flux
Flux.range(1, 5)
.parallel(10)
.runOn(Schedulers.parallel())
.map(i -> "https://www.google.com")
.flatMap(uri -> Mono.fromCallable(new HttpGetTask(httpClient, uri)))
HttpGetTask is a Callable whose actual implementation is irrelevant in this case, it makes a HTTP GET call to the given URI and returns the content if successful.
Now, I'd like to slow down the emission by introducing an artificial delay, such that up to 10 threads are started simultaneously, but each one doesn't complete as soon as HttpGetTask is done. For example, say no thread must finish before 3 seconds. How do I achieve that?
If the requirement is really "not less than 3s" you could add a delay of 3 seconds to the Mono inside the flatMap by using Mono.fromCallable(...).delayElement(Duration.ofSeconds(3)).
We are using Elasticsearch 0.90.7 in our Scala Play Framework application, where the end of our "doSearch" method looks like:
def doSearch(...) = {
...
val actionRequessBuilder: ActionRequestBuilder // constructed earlier in the method
val executedFuture: ListenableActionFuture<Response> = actionRequestBuilder.execute
return executedFuture.actionGet
}
where ListenableActionFuture extends java.util.concurrent.Future, and ListenableActionFuture#actionGet is basically the same as Future#get
This all works fine when we execute searches sequentially, however when we try to execute multiple searches in parallel:
val search1 = scala.concurrent.Future(doSearch(...))
val search2 = scala.concurrent.Future(doSearch(...))
return Await.result(search1, defaultDuration) -> Await.result(search2, defaultDuration))
we're sometimes (less than 1 or 2% of the time) getting unexpected timeouts on our scala futures, even when using an extremely long timeout during qa (5 seconds, where a search always executes in less than 200ms). This also occurs when using the scala global execution context as well as when using the Play default execution context.
Is there some sort of unexpected interaction going on here as a result of having a java future wrapped in a scala future? I would have thought that the actionGet call on the java future at the end of doSearch would have prevented the two futures from interfering with each other, but evidently that may not be the case.
I thought it was established somewhere that blocking is evil. Evil!
In this case, Await.result will block the current thread, because it's waiting for a result.
Await wraps the call in blocking, in an attempt to notify the thread pool that it might want to grow some threads to maintain its desired parallelism and avoid deadlock.
If the current thread is not a Scala BlockContext, then you get mere blockage.
Whatever your precise configuration, presumably you're holding onto a thread while blocked, and the thunk you're running for search wants to run something and can't because the pool is exhausted.
What's relevant is what pool produced the current Thread: whether the go-between Future is on a different pool doesn't matter if, at bottom, you need to use more threads from the current pool and it is exhausted.
Of course, that's just a guess.
It makes more sense to have a single future that gets the value from both searches, with a timeout.
But if you wind up with multiple Futures, it makes sense to use Future.sequence and wait on that.