How to parallelize sequential stream in Java - java

I need to build a sort of parallel pipeline. Pipeline consists of 3 steps
readLines(): read 100.000 lines from filed
firstQuery(): build and send query to DB using 100.000 lines from readLines()
secondQuery(): build and send another query using firstQuery() results
readLines() takes 1 second, returns Stream<List<Line>>
firstQuery() takes 10 seconds returns Stream<List<FirstResult>>
secondQuery() takes 2 minutes Stream<List<SecondResult>>
No matter what I do with Executors, futures all methods are executed in sequential manner.
I suppose it's because of readLines() which returns Stream<List<Line>>, but it can't be paralleled since read from file is sequential.
What is the right way to do smth like that:
readLines() // limit concurrency to 2 (not more than 00.000 or lines in memory at the same time)
// as soon as lines are read from file, pass them to async
1: List<Line> -> firstQuery() -> secondQuery()
2: List<Line> -> firstQuery() -> secondQuery()
3: List<Line> -> firstQuery() -> secondQuery()
4: List<Line> -> firstQuery() -> secondQuery()
... more here
10: List<Line> -> firstQuery() -> secondQuery()
How could it look like?
What I've tried
add parallel to readLines() which return Stream<List<Line>>
no luck
add parallel, custom executors with Futures to firstQuery() -> secondQuery(), no luck still sequential

Related

Bug in parallelStream in java

Can someone tell me why this is happening and if it's expected behaviour or a bug
List<Integer> a = Arrays.asList(1,1,3,3);
a.parallelStream().filter(Objects::nonNull)
.filter(value -> value > 2)
.reduce(1,Integer::sum)
Answer: 10
But if we use stream instead of parallelStream I'm getting the right & expected answer 7
The first argument to reduce is called "identity" and not "initialValue".
1 is no identity according to addition. 1 is identity for multiplication.
Though you need to provide 0 if you want to sum the elements.
Java uses "identity" instead of "initialValue" because this little trick allows to parallelize reduce easily.
In parallel execution, each thread will run the reduce on a part of the stream, and when the threads are done, they will be combined using the very same reduce function.
Though it will look something like this:
mainThread:
start thread1;
start thread2;
wait till both are finished;
thread1:
return sum(1, 3); // your reduce function applied to a part of the stream
thread2:
return sum(1, 3);
// when thread1 and thread2 are finished:
mainThread:
return sum(sum(1, resultOfThread1), sum(1, resultOfThread2));
= sum(sum(1, 4), sum(1, 4))
= sum(5, 5)
= 10
I hope you can see, what happens and why the result is not what you expected.

How to handle concurrency in API?

Context:
I have a storage/DB where I store a <String, List<Integer>> -> <uniqueName, numbersPresentForTheName> mapping each time:
I am writing an API that does the following:
How can we make sure this does not happen?
The usual answer is that you use a compare-and-update operation, rather than just an update operation.
t = 1 -> call readStorage
t = 10 -> get response from readStorage <ABCD, [1,2,3]>
t = 11 -> comapreAndUpdateStorage ABCD, [1,2,3] -> [1,2,3,4]
t = 5 -> Request 1 to the API: <ABCD, [5]>
t = 6 -> call readStorage
t = 11 -> get response from readStorage <ABCD, [1,2,3]> (note that we didn't get 4 as first call didnt update 4 yet)
t = 13 -> compareAnUpdateStorage ABCD, [1,2,3] -> [1,2,3,5] (This call will fail, because the current value is [1,2,3,4])
In other words, we're trying to address the lost edit problem by ensuring that the first edit always wins.
That's the first piece; the rest of the work is choosing an appropriate retry strategy.
If your two calls are on separate threads, and the steps of the two calls are interleaved over time, then you are seeing correct behavior.
If both threads retrieved a set of three values, and both threads replaced the existing set in storage with a new set of four values, then whichever thread saves their set last wins.
The key idea here to understand is atomic actions. The retrieval of the data and the update of the data in your system are separate actions. To achieve your aim, you must combine those actions two actions into one. This combining is the purpose of transactions and locks in databases.
See also the correct Answer by VoiceOfUnreason along the same line.

Reactor Pattern for Consuming Whole Stream and Starting New Flux from Result

Is there a reactor pattern to consume a whole stream and then create a new flux from the result? I have a flux sourced from a file, which may be larger than memory. In the middle of the stream, I need to perform some aggregate operation, such as sorting or grouping, which cannot be done entirely in-memory. I then need to continue processing the stream from the result of the grouping/sorting operation. Is there a reactor-ish way of doing it in a single Flux?
Flux<T> -> file -> grouping / sorting -> Flux<T>
The only way I can think of is through multiple fluxes
flux.map(t -> ...).doOnNext(t -> writeToFile(...));
externalProcessFile();
Flux<T> f = Flux.using(() -> Files.lines(Paths.get("/path/to/file")), Flux::fromStream, BaseStream::close).map(t -> nextFunc(t))...

Java Reactor StepVerifier.withVirtualTime loop: repeatedly check with "expectNoEvent()", "expectNext()" and "thenAwait()"

I am learning Java 11 reactor. I have seen this example:
StepVerifier.withVirtualTime(() -> Flux.interval(Duration.ofSeconds(1)).take(3600))
.expectSubscription()
.expectNextCount(3600);
This example just checks that with a Flux<Long> which increments 1 after every second till one hour, the final count is 3600.
But, is there any way to check the counter repeatedly after every second?
I know this:
.expectNoEvent(Duration.ofSeconds(1))
.expectNext(0L)
.thenAwait(Duration.ofSeconds(1))
But I have seen no way to repeatedly check this after every second, like:
.expectNoEvent(Duration.ofSeconds(1))
.expectNext(i)
.thenAwait(Duration.ofSeconds(1))
when i increments till 3600. Is there?
PS:
I tried to add verifyComplete() at last in a long-running tests and it will never end. Do I have to? Or just ignore it?
You can achieve what you wanted by using expectNextSequence. You have to pass an Iterable and include every element you expect to arrive. See my example below:
var longRange = LongStream.range(0, 3600)
.boxed()
.collect(Collectors.toList());
StepVerifier
.withVirtualTime(() -> Flux.interval(Duration.ofSeconds(1)).take(3600))
.expectSubscription()
.thenAwait(Duration.ofHours(1))
.expectNextSequence(longRange)
.expectComplete().verify();
If you don't add verifyComplete() or expectComplete().verify() then the JUnit test won't wait until elements arrive from the flux and just terminate.
For further reference see the JavaDoc of verify():
this method will block until the stream has been terminated

Hadoop Recursive Map

I have a requirement that my mapper may in some cases produce a new key/value for another mapper to handle. Is there a sane way to do this? I've thought about writing my own custom input format (queue?) to achieve this. Any Ideas? Thanks!
EDIT: I should clarify
Method 1
Map Step 1
(foo1, bar1) -> out1
(foo2, bar2) -> out2
(foo3, bar3) -> (fooA, barA), (fooB, barB)
(foo4, bar4) -> (fooC, barC)
Reduction Step 1:
(out1) -> ok
(out2) -> ok
((fooA, barA), (fooB, barB)) -> create Map Step 2
((fooC, barC)) -> also send this to Map Step 2
Map Step 2:
(fooA, barA) -> out3
(fooB, barB) -> (fooD, barD)
(fooC, barC) -> out4
Reduction Step 2:
(out3) -> ok
((fooD, barD)) -> create Map Step 3
(out4) -> ok
Map Step 3:
(fooD, barD) -> out5
Reduction Step 3:
(out5) -> ok
-- no more map steps. finished --
So it's fully recursive. Some key/values emit output for reduction, some generate new key/values for mapping. I don't really know how many Map or Reduction steps i may encounter on a given run.
Method 2
Map Step 1
(foo1, bar1) -> out1
(foo2, bar2) -> out2
(foo3, bar3) -> (fooA, barA), (fooB, barB)
(foo4, bar4) -> (fooC, barC)
(fooA, barA) -> out3
(fooB, barB) -> (fooD, barD)
(fooC, barC) -> out4
(fooD, barD) -> out5
Reduction Step 1:
(out1) -> ok
(out2) -> ok
(out3) -> ok
(out4) -> ok
(out5) -> ok
This Method would get the mapper to feed it's own input list. I'm not sure which way would be simpler in the end to implement.
The "Method 1" way of doing recursion through Hadoop forces you to run the full dataset through both Map and reduce for each "recursion depth". This implies that you must be sure how deep this can go AND you'll suffer a massive performance impact.
Can you say for certain that the recursion depth is limited?
If so then I would definitely go for "Method 2" and actually build the mapper in such a way that does the required recursion within one mapper call.
It's simpler and saves you a lot of performance.
Use oozie [Grid workflow definition language] to string together two M/R jobs with first one only having mapper.
http://yahoo.github.com/oozie
In best of my understanding Hadoop MR framework in the beginning of the job is planning what map tasks should be executed and is not ready for the new map tasks to appear dynamically.
I would suggest two possible solutions:
a) if you emit another pairs during map phase - feed them to the same mapper. So mapper will take its usual arguments and after processing will look into some kind of the internal local queue for additional pairs to process. It will work good if there are small sets of the secondary pairs, and data locality is not that important.
b) If indeed you are processing directories or something similar - you can iterate over the structure in the main of the job package, and built all splits you need right away.

Categories

Resources