Reactive Pull with muti-threaded RxJava

Reactive Pull with muti-threaded RxJava - java

I am trying to build a reactive pull observer in RxJava.
My observer is like so:
Observable<Command> myObs = Observable.create(s -> {
Command command;
int i = 0;
do {
command = NetworkOperation1.call(i);
logger.info("Init command " + i);
s.onNext(command);
i++;
} while (!command.isLast() && i < MAX);
s.onCompleted();
});
And I want to process it in 4 concurrent batches (buffer), like so:
myObs
.buffer(10)
.flatMap(batch -> {
return Observable
.from(batch)
.subscribeOn(Schedulers.io())
.map(c -> {
Intermediate m = NetworkOperation2.call(c));
logger.info("Done intermediate " + m.id);
return m;
}
}, 4);
And then, I need to batch the results in a different size, like so:
.buffer(25)
.subscribeOn(Schedulers.newThread())
.subscribe(list ->
logger.info("Finished batch with " + list.size());
The problem is that the Commands in the Observable are processed all at once, while I want them to be processed as they are needed.
Here is the log of what happens: (notice all 1000 commands are run at once, instead of called as needed)
Init command 0
Init command 1
Init command 2
...
Init command 999
Done intermediate 0
Done intermediate 1
...
Done intermediate 24
Finished batch with 25
Done intermediate 25
Done intermediate 26
...
Done intermediate 49
Finished batch with 25
...
QUESTION: Is there a way to pause the thread of the Observer so it doesn't emmit all the commands at once or something like this? I have tried the request() operator but I can't get it to work.
Thank you.

You need backpressure aware sources and operators. The operators you are using support backpressure but your source does not.
Do this instead:
myObs = Observable.range(1,1000)
.map(i -> NetworkOperation1.call(i));
Observable.range supports backpressure so will only emit when requested to do so.

Related

Complete all tasks, but no more K tasks at the same time via Project Reactor

I'm beginner in Project Reactor and think it's pretty easy, but I can't find the solution.
I have N expensive tasks to do, and I want to implement something like Bounded Semaphore in Java (do not request next element until current count of running task less than K).
Shortly: complete all tasks, but no more K tasks at the same time
Flux.range(1, 100)
.parallel()
.limit(K) // Something like this
.doOnNext(i -> expensiveWork(i))
.subscribe()
Found this post on SO, but it's not for Reactor. But the meaning is the same. Please, help.
Close to my real case:
httpClient.getMainPageAsMono()
.flatMapMany(html -> {
Flux.fromIterable(getLinksFromPage(it));
})
.parallel(k)
.runOn(Schedulers.boundedElastic())
.flatMap(link -> {
// ON THIS PART IT EXECUTES ALL LINKS AT THE SAME TIME
// INSTEAD OF MAKING THROATTLE
client.getAnotherPageByLink(link);
})
.....
.subscribe()
That is, if the getLinksFromPage(it) function returns 1000 links, each next link will not be taken until client.getAnotherPageByLink(link) does it not finished.

Using just .parallel() will give you a ParallelFlux, but in order to tell the resulting ParallelFlux where to run each rail (and, by extension, to run rails in parallel) you have to use .runOn(Scheduler scheduler).
So we should use .parallel(int parallelism) with .runOn(Scheduler scheduler):
public static void main(String[] args) throws InterruptedException {
int k = 3;
Flux.range(1, 100)
.parallel(k) // k rails
.runOn(Schedulers.boundedElastic()) // the rails will run on this scheduler
.doOnNext(i -> expensiveWork(i))
.subscribe();
Thread.currentThread().join(); // Just so program won't finish
}
private static void expensiveWork(Integer i) {
Instant start = Instant.now();
while (Duration.between(start, Instant.now()).getSeconds() < 5) ;
System.out.println(Instant.now()+" - "+i+" - Done expensive work");
}
Output:
2021-06-12T13:46:58.445Z - 3 - Done expensive work
2021-06-12T13:46:58.445Z - 1 - Done expensive work
2021-06-12T13:46:58.445Z - 2 - Done expensive work
2021-06-12T13:47:03.453Z - 5 - Done expensive work
2021-06-12T13:47:03.453Z - 6 - Done expensive work
2021-06-12T13:47:03.453Z - 4 - Done expensive work
2021-06-12T13:47:08.453Z - 8 - Done expensive work
2021-06-12T13:47:08.453Z - 7 - Done expensive work
2021-06-12T13:47:08.453Z - 9 - Done expensive work
...
As you can see, we limited the number of tasks that are executed in parallel to k.

What about this solution? I removed parallel from Flux, in order to bufferize 10 elements. Each elements can be then handled in parallel
public static final void main(String... args) {
Flux.range(1, 1000)
.buffer(10)
.doOnNext(grp -> grp.parallelStream().forEach(p -> System.out.println(Instant.now() + " : " + p)))
.doOnNext(grp -> sleep(1000)) // Wait for 1 second to see how the algorithm is working
.doOnNext(grp -> System.out.println("####"))
.subscribe();
}
private static void sleep(int millis) {
try {
Thread.sleep(millis);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
Output is:
2021-06-12T14:16:23.760298200Z : 8
2021-06-12T14:16:23.760298200Z : 4
2021-06-12T14:16:23.760298200Z : 10
2021-06-12T14:16:23.760298200Z : 1
2021-06-12T14:16:23.760298200Z : 3
2021-06-12T14:16:23.760298200Z : 5
2021-06-12T14:16:23.760298200Z : 7
2021-06-12T14:16:23.760298200Z : 2
2021-06-12T14:16:23.760298200Z : 6
2021-06-12T14:16:23.760298200Z : 9
####
2021-06-12T14:16:24.784628Z : 17
2021-06-12T14:16:24.784628Z : 16
2021-06-12T14:16:24.784628Z : 20
2021-06-12T14:16:24.784628Z : 14
2021-06-12T14:16:24.784628Z : 11
2021-06-12T14:16:24.784628Z : 13
2021-06-12T14:16:24.784628Z : 18
2021-06-12T14:16:24.784628Z : 19
2021-06-12T14:16:24.784628Z : 12
2021-06-12T14:16:24.785801500Z : 15
As you can see, each 10 elements are processed by group in parallel within each second

This can be easily accomplished without parallel using an overloaded version of flatMap where you can specify concurrency:
flatMap(Function<? super T,? extends Publisher<? extends V>> mapper, int concurrency)
httpClient.getMainPageAsMono()
.flatMapMany(html -> {
Flux.fromIterable(getLinksFromPage(it));
})
.flatMap(link -> client.getAnotherPageByLink(link), k)
.....
.subscribe()
Based on the code, this operation is not expensive in terms of CPU, rather in terms of IO, so using ParallelFlux is not necessary.

Does flink streaming have cache/persist feature? (like spark)

I have a Flink streaming program that have branch processing logic after a long transformation logic. Will the long transformation logic be executed multiple times? Pseudo code:
env = getEnvironment();
DataStream<Event> inputStream = getInputStream();
tempStream = inputStream.map(very_heavy_computation_func)
output1 = tempStream.map(func1);
output1.addSink(sink1);
output2 = tempStream.map(func2);
output2.addSink(sink2);
env.execute();
Questions:
How many times would inputStream.map(very_heavy_computation_func) be executed?
Once or twice?
If twice, how can I cache tempStream (or other method) to avoid the previous transformation being executed multiple times?

You can actually answer (1) easily by just trying out more or less exactly your example:
public class TestProgram {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
SingleOutputStreamOperator<Integer> stream = env.fromElements(1, 2, 3)
.map(i -> {
System.out.println("Executed expensive computation for: " + i);
return i;
});
stream.map(i -> i).addSink(new PrintSinkFunction<>());
stream.map(i -> i).addSink(new PrintSinkFunction<>());
env.execute();
}
}
produces (on my machine, for example):
Executed expensive computation for: 3
Executed expensive computation for: 1
Executed expensive computation for: 2
9> 3
8> 2
8> 2
9> 3
7> 1
7> 1
You can also find a more technical answer here which explains how records are replicated to downstream operators, rather than running the source/operator multiple times.

Flux.range waits to emit more element once 256 elements are reached

I wrote this code:
Flux.range(0, 300)
.doOnNext(i -> System.out.println("i = " + i))
.flatMap(i -> Mono.just(i)
.subscribeOn(Schedulers.elastic())
.delayElement(Duration.ofMillis(1000))
)
.doOnNext(i -> System.out.println("end " + i))
.blockLast();
When running it, the first System.out.println shows that the Flux stop emitting numbers at the 256th element, then it waits for the older to be completed before emitting new ones.
Why is this happening?
Why 256?

Why this happening?
The flatMap operator can be characterized as operator that (rephrased from javadoc):
subscribes to its inners eagerly
does not preserve ordering of elements.
lets values from different inners interleave.
For this question the first point is important. Project Reactor restricts the
number of in-flight inner sequences via concurrency parameter.
While flatMap(mapper) uses the default parameter the flatMap(mapper, concurrency) overload accepts this parameter explicitly.
The flatMaps javadoc describes the parameter as:
The concurrency argument allows to control how many Publisher can be subscribed to and merged in parallel
Consider the following code using concurrency = 500
Flux.range(0, 300)
.doOnNext(i -> System.out.println("i = " + i))
.flatMap(i -> Mono.just(i)
.subscribeOn(Schedulers.elastic())
.delayElement(Duration.ofMillis(1000)),
500
// ^^^^^^^^^^
)
.doOnNext(i -> System.out.println("end " + i))
.blockLast();
In this case there is no waiting:
i = 297
i = 298
i = 299
end 0
end 1
end 2
In contrast if you pass 1 as concurrency the output will be similar to:
i = 0
end 0
i = 1
end 1
Awaiting one second before emitting the next element.
Why 256?
256 is the default value for concurrency of flatMap.
Take a look at Queues.SMALL_BUFFER_SIZE:
public static final int SMALL_BUFFER_SIZE = Math.max(16,
Integer.parseInt(System.getProperty("reactor.bufferSize.small", "256")));

Why is CompletableFuture join/get faster in separate streams than using one stream

For the following program I am trying to figure out why using 2 different streams parallelizes the task and using the same stream and calling join/get on the Completable future makes them take longer time equivalent to as if they were sequentially processed).
public class HelloConcurrency {
private static Integer sleepTask(int number) {
System.out.println(String.format("Task with sleep time %d", number));
try {
TimeUnit.SECONDS.sleep(number);
} catch (InterruptedException e) {
e.printStackTrace();
return -1;
}
return number;
}
public static void main(String[] args) {
List<Integer> sleepTimes = Arrays.asList(1,2,3,4,5,6);
System.out.println("WITH SEPARATE STREAMS FOR FUTURE AND JOIN");
ExecutorService executorService = Executors.newFixedThreadPool(6);
long start = System.currentTimeMillis();
List<CompletableFuture<Integer>> futures = sleepTimes.stream()
.map(sleepTime -> CompletableFuture.supplyAsync(() -> sleepTask(sleepTime), executorService)
.exceptionally(ex -> { ex.printStackTrace(); return -1; }))
.collect(Collectors.toList());
executorService.shutdown();
List<Integer> result = futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());
long finish = System.currentTimeMillis();
long timeElapsed = (finish - start)/1000;
System.out.println(String.format("done in %d seconds.", timeElapsed));
System.out.println(result);
System.out.println("WITH SAME STREAM FOR FUTURE AND JOIN");
ExecutorService executorService2 = Executors.newFixedThreadPool(6);
start = System.currentTimeMillis();
List<Integer> results = sleepTimes.stream()
.map(sleepTime -> CompletableFuture.supplyAsync(() -> sleepTask(sleepTime), executorService2)
.exceptionally(ex -> { ex.printStackTrace(); return -1; }))
.map(CompletableFuture::join)
.collect(Collectors.toList());
executorService2.shutdown();
finish = System.currentTimeMillis();
timeElapsed = (finish - start)/1000;
System.out.println(String.format("done in %d seconds.", timeElapsed));
System.out.println(results);
}
}
Output
WITH SEPARATE STREAMS FOR FUTURE AND JOIN
Task with sleep time 6
Task with sleep time 5
Task with sleep time 1
Task with sleep time 3
Task with sleep time 2
Task with sleep time 4
done in 6 seconds.
[1, 2, 3, 4, 5, 6]
WITH SAME STREAM FOR FUTURE AND JOIN
Task with sleep time 1
Task with sleep time 2
Task with sleep time 3
Task with sleep time 4
Task with sleep time 5
Task with sleep time 6
done in 21 seconds.
[1, 2, 3, 4, 5, 6]

The two approaches are quite different, let me try to explain it clearly
1st approach : In the first approach you are spinning up all Async requests for all 6 tasks and then calling join function on each one of them to get the result
2st approach : But in the second approach you are calling the join immediately after spinning the Async request for each task. For example after spinning Async thread for task 1 calling join, make sure that thread to complete task and then only spin up the second task with Async thread
Note : Another side if you observe the output clearly, In the 1st approach output appears in random order since the all six tasks were executed asynchronously. But during second approach all tasks were executed sequentially one after the another.
I believe you have an idea how stream map operation is performed, or you can get more information from here or here
To perform a computation, stream operations are composed into a stream pipeline. A stream pipeline consists of a source (which might be an array, a collection, a generator function, an I/O channel, etc), zero or more intermediate operations (which transform a stream into another stream, such as filter(Predicate)), and a terminal operation (which produces a result or side-effect, such as count() or forEach(Consumer)). Streams are lazy; computation on the source data is only performed when the terminal operation is initiated, and source elements are consumed only as needed.

The stream framework does not define the order in which map operations are executed on stream elements, because it is not intended for use cases in which that might be a relevant issue. As a result, the particular way your second version is executing is equivalent, essentially, to
List<Integer> results = new ArrayList<>();
for (Integer sleepTime : sleepTimes) {
results.add(CompletableFuture
.supplyAsync(() -> sleepTask(sleepTime), executorService2)
.exceptionally(ex -> { ex.printStackTrace(); return -1; }))
.join());
}
...which is itself essentially equivalent to
List<Integer> results = new ArrayList<>()
for (Integer sleepTime : sleepTimes) {
results.add(sleepTask(sleepTime));
}

#Deadpool answered it pretty well, just adding my answer which can help someone understand it better.
I was able to get an answer by adding more printing to both methods.
TLDR
2 stream approach: We are starting up all 6 tasks asynchronously and then calling join function on each one of them to get the result in a separate stream.
1 stream approach: We are calling the join immediately after starting up each task. For example after spinning a thread for task 1, calling join makes sure the thread waits for completion of task 1 and then only spin up the second task with async thread.
Note: Also, if we observe the output clearly, in the 1 stream approach, output appears sequential order since the all six tasks were executed in order. But during second approach all tasks were executed in parallel, hence the random order.
Note 2: If we replace stream() with parallelStream() in the 1 stream approach, it will work identically to 2 stream approach.
More proof
I added more printing to the streams which gave the following outputs and confirmed the note above :
1 stream:
List<Integer> results = sleepTimes.stream()
.map(sleepTime -> CompletableFuture.supplyAsync(() -> sleepTask(sleepTime), executorService2)
.exceptionally(ex -> { ex.printStackTrace(); return -1; }))
.map(f -> {
int num = f.join();
System.out.println(String.format("doing join on task %d", num));
return num;
})
.collect(Collectors.toList());
WITH SAME STREAM FOR FUTURE AND JOIN
Task with sleep time 1
doing join on task 1
Task with sleep time 2
doing join on task 2
Task with sleep time 3
doing join on task 3
Task with sleep time 4
doing join on task 4
Task with sleep time 5
doing join on task 5
Task with sleep time 6
doing join on task 6
done in 21 seconds.
[1, 2, 3, 4, 5, 6]
2 streams:
List<CompletableFuture<Integer>> futures = sleepTimes.stream()
.map(sleepTime -> CompletableFuture.supplyAsync(() -> sleepTask(sleepTime), executorService)
.exceptionally(ex -> { ex.printStackTrace(); return -1; }))
.collect(Collectors.toList());
List<Integer> result = futures.stream()
.map(f -> {
int num = f.join();
System.out.println(String.format("doing join on task %d", num));
return num;
})
.collect(Collectors.toList());
WITH SEPARATE STREAMS FOR FUTURE AND JOIN
Task with sleep time 2
Task with sleep time 5
Task with sleep time 3
Task with sleep time 1
Task with sleep time 4
Task with sleep time 6
doing join on task 1
doing join on task 2
doing join on task 3
doing join on task 4
doing join on task 5
doing join on task 6
done in 6 seconds.
[1, 2, 3, 4, 5, 6]

Only consume latest item with onBackpressureLatest()

I have a producer which emits items periodically and a consumer which is sometimes quite slow. It is important that the consumer only works with recent items. I thought onBackpressureLatest() is the perfect solution for this problem. So I wrote the following test code:
PublishProcessor<Integer> source = PublishProcessor.create();
source
.onBackpressureLatest()
.observeOn(Schedulers.from(Executors.newCachedThreadPool()))
.subscribe(i -> {
System.out.println("Consume: " + i);
Thread.sleep(100);
});
for (int i = 0; i < 10; i++) {
System.out.println("Produce: " + i);
source.onNext(i);
}
I expected it to log something like:
Produce: 0
...
Produce: 9
Consume: 0
Consume: 9
Instead, I get
Produce: 0
...
Produce: 9
Consume: 0
Consume: 1
...
Consume: 9
onBackpressureLatest() and onBackpressureDrop() do both not have any effect. Only onBackpressureBuffer(i) causes an exception.
I use rxjava 2.1.9. Any ideas what the problem/my misunderstanding could be?

observeOn has an internal buffer (default 128 elements) that will pick up all source items easily immediately, thus the onBackpressureLatest is always fully consumed.
Edit:
The smallest buffer you can create is 1 which should provide the required pattern:
source.onBackpressureLatest()
.observeOn(Schedulers.from(Executors.newCachedThreadPool()), false, 1)
.subscribe(v -> { /* ... */ });
(the earlier delay + rebatchRequest combination is practically equivalent to this).

I think the following is supposed to work but I'm not entirely sure
PublishProcessor<Integer> source = PublishProcessor.create();
source
.onBackpressureLatest()
.switchMap(item -> Flowable.just(item)) // <--
.observeOn(
Schedulers.from(Executors.newCachedThreadPool()))
.subscribe(i -> {
System.out.println("Consume: " + i);
Thread.sleep(100);
});

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reactive Pull with muti-threaded RxJava - java

You need backpressure aware sources and operators. The operators you are using support backpressure but your source does not. Do this instead: myObs = Observable.range(1,1000) .map(i -> NetworkOperation1.call(i)); Observable.range supports backpressure so will only emit when requested to do so.

Related

Complete all tasks, but no more K tasks at the same time via Project Reactor

Does flink streaming have cache/persist feature? (like spark)

Flux.range waits to emit more element once 256 elements are reached

Why is CompletableFuture join/get faster in separate streams than using one stream

Only consume latest item with onBackpressureLatest()

Categories

Resources