Reactor Flux: how to parse file with header - java

Recently I started using project reactor 3.3 and I don't know what is the best way to handle flux of lines, with first line as column names, then use those column names to process/convert all other lines. Right now I'm doing this way:
Flux<String> lines = ....;
Mono<String[]> columns = Mono.from(lines.take(1).map(header -> header.split(";"))); //getting first line
Flux<SomeDto> objectFlux = lines.skip(1) //skip first line
.flatMapIterable(row -> //iterating over lines
columns.map(cols -> convert(cols, row))); //convert line into SomeDto object
So is it the right way?

So is it the right way?
There's always more than one way to cook an egg - but the code you have there seems odd / suboptimal for two main reasons:
I'd assume it's one line per record / DTO you want to extract, so it's a bit odd you're using flatMapIterable() rather than flatMap()
You're going to resubscribe to lines once for each line, when you re-evaluate that Mono. That's almost certainly not what you want to do. (Caching the Mono helps, but you'd still resubscribe at least twice.)
Instead you may want to look at using switchOnFirst(), which will enable you to dynamically transform the Flux based on the first element (the header in your case.) This means you can do something like so:
lines
.switchOnFirst((signal, flux) -> flux.zipWith(Flux.<String[]>just(signal.get().split(";")).repeat()))
.map(row -> convert(row.getT1(), row.getT2()))
Note this is a bear-bones example, in real-world use you'll need to check whether the signal actually has a value as per the docs:
Note that the source might complete or error immediately instead of emitting, in which case the Signal would be onComplete or onError. It is NOT necessarily an onNext Signal, and must be checked accordingly.

Related

How to handle concurrency in API?

Context:
I have a storage/DB where I store a <String, List<Integer>> -> <uniqueName, numbersPresentForTheName> mapping each time:
I am writing an API that does the following:
How can we make sure this does not happen?
The usual answer is that you use a compare-and-update operation, rather than just an update operation.
t = 1 -> call readStorage
t = 10 -> get response from readStorage <ABCD, [1,2,3]>
t = 11 -> comapreAndUpdateStorage ABCD, [1,2,3] -> [1,2,3,4]
t = 5 -> Request 1 to the API: <ABCD, [5]>
t = 6 -> call readStorage
t = 11 -> get response from readStorage <ABCD, [1,2,3]> (note that we didn't get 4 as first call didnt update 4 yet)
t = 13 -> compareAnUpdateStorage ABCD, [1,2,3] -> [1,2,3,5] (This call will fail, because the current value is [1,2,3,4])
In other words, we're trying to address the lost edit problem by ensuring that the first edit always wins.
That's the first piece; the rest of the work is choosing an appropriate retry strategy.
If your two calls are on separate threads, and the steps of the two calls are interleaved over time, then you are seeing correct behavior.
If both threads retrieved a set of three values, and both threads replaced the existing set in storage with a new set of four values, then whichever thread saves their set last wins.
The key idea here to understand is atomic actions. The retrieval of the data and the update of the data in your system are separate actions. To achieve your aim, you must combine those actions two actions into one. This combining is the purpose of transactions and locks in databases.
See also the correct Answer by VoiceOfUnreason along the same line.

Sample all but first elements from flux in project reactor

In project reactor flux there is a sample method Flux#sample java doc. It changes flux so that it emits events only at the ends of specified periods.
Is it possible to tweak this behaviour and achieve this : on first element - emit it instantly , start sampling with delay from 2nd up to the end. Basically I want to exclude first (and only first) element from sampling so that it is emmited without initial wait.
Would it be possible to achieve using built-in operators ? If not then does anybody have an idea how to approach this problem ?
Here is a simplest example of what I want to achieve :
Flux<String> inputFlux = Flux.just("first", "second", "third").delayElements(Duration.ofMillis(400));
Flux<String> transformed = /*do some magic with input flux*/;
StepVerifier.create(transformed)
.expectNext("first")//first should always be emmited instantly
//second arrives 400ms after first
//third arrives 400ms after second
.expectNoEvent(Duration.ofSeconds(1))
.expectNext("third")//after sample period last received element should be received
.verifyComplete();
By turning the source flux myFlux into a hot flux, you can easily achieve this:
Flux<T> myFlux;
Flux<T> sharedFlux = myFlux.publish().refCount(2);
Flux<T> first = sharedFlux.take(1);
Flux<T> sampledRest = sharedFlux.skip(1).sample(Duration.ofMillis(whatever));
return Flux.merge(first, sampledRest);
You could achieve it with Flux#sample(org.reactivestreams.Publisher<U>) method.
yourFlux.take(1)
.mergeWith(yourFlux.sample(Flux.interval(yourInterval)
.delaySubscription(yourFlux.take(1))))

Go back 'n' lines in file using Stream.lines

I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions

RxJava 2 operator combination that delivers only the first Value from a bunch

The idea is when I call publishSubject.onNext(someValue) multiple times I need to get only one value like debounce operator does, but it delivers the last value, and I need to skip all values except first in a bunch till I stop calling onNext() for 1 sec.
I've tried to use something like throttleFirst(1000,TimeUnit.MILLISECONDS) but it's not working like debounce, it just makes windows after every delivery and after 1 sec immediate deliver next value.
Try this:
// Observable<T> stream = ...;
stream.window(stream.debounce(1, TimeUnit.Seconds))
.flatMap(w -> w.take(1));
Explanation: If I understand you correctly, you want to emit items if none have been emitted for 1 second prior. This is equivalent to getting the first element following an item debounced by 1 second. The below marble diagram may also help:
You can use the first operator. Like:
Observable.first()
It will take only the first value

Enhance the degree of parallelization of groupReduce transformation

In my Flink program I transform my data using a flatMap operation which divides several blocks of data in multiple smaller blocks. These blocks have a "position" attribute which describes their position in the respective original block. Now I use a groupReduce which needs to transform all small blocks which share the same "position" attribute. So it should be easily distributable on multiple nodes. But when I run my program on multiple nodes the groupReduce is executed with a dop of 1.
I guess this is because I have only one DataSet, but it seems that a GroupedDataSet is not available in Flink Java API. Is there another possibility to enhance the dop of my groupReduce transformation?
Here is the code I am using (dummy code ignoring "details"):
DataSet<SlicedTile> slicedTiles = tiles.flatMap()
.groupBy(position)
.sortGroup(time)
.getDataSet()
//Until here the dop is correct
DataSet<SlicedTile> processedSlicedTiles = slicedTiles.reduceGroup;
The problem with your code is the getDataSet() call. It returns the input of the grouping operation. Hence, the dataset represented by slicedTiles is neither grouped nor are its groups sorted but instead it is the result of the flatMap transformation and the groupBy and sortGroup calls are not considered in the program at all.
Applying a groupReduce (or reduce) operation on a non-grouped dataset is always a non-parallel operation because all elements of the input data set are processed as a single group.
Logically, the three transformation groupBy().sortGroup().reduceGroup() belong together and are translated into a single groupReduce operator (maybe with an additional combiner if the GroupReduceFunction is combinable).
If you change your implementation as follows, it should work as expected.
DataSet<SlicedTile> slicedTiles = tiles.flatMap()
.groupBy(position)
.sortGroup(time)
.reduceGroup(yourFunction);
I will open a JIRA issue to add JavaDocs to the Grouping.getDataSet() method to document the behavior of this function.

Categories

Resources