Context:
I have a storage/DB where I store a <String, List<Integer>> -> <uniqueName, numbersPresentForTheName> mapping each time:
I am writing an API that does the following:
How can we make sure this does not happen?
The usual answer is that you use a compare-and-update operation, rather than just an update operation.
t = 1 -> call readStorage
t = 10 -> get response from readStorage <ABCD, [1,2,3]>
t = 11 -> comapreAndUpdateStorage ABCD, [1,2,3] -> [1,2,3,4]
t = 5 -> Request 1 to the API: <ABCD, [5]>
t = 6 -> call readStorage
t = 11 -> get response from readStorage <ABCD, [1,2,3]> (note that we didn't get 4 as first call didnt update 4 yet)
t = 13 -> compareAnUpdateStorage ABCD, [1,2,3] -> [1,2,3,5] (This call will fail, because the current value is [1,2,3,4])
In other words, we're trying to address the lost edit problem by ensuring that the first edit always wins.
That's the first piece; the rest of the work is choosing an appropriate retry strategy.
If your two calls are on separate threads, and the steps of the two calls are interleaved over time, then you are seeing correct behavior.
If both threads retrieved a set of three values, and both threads replaced the existing set in storage with a new set of four values, then whichever thread saves their set last wins.
The key idea here to understand is atomic actions. The retrieval of the data and the update of the data in your system are separate actions. To achieve your aim, you must combine those actions two actions into one. This combining is the purpose of transactions and locks in databases.
See also the correct Answer by VoiceOfUnreason along the same line.
Related
Recently I started using project reactor 3.3 and I don't know what is the best way to handle flux of lines, with first line as column names, then use those column names to process/convert all other lines. Right now I'm doing this way:
Flux<String> lines = ....;
Mono<String[]> columns = Mono.from(lines.take(1).map(header -> header.split(";"))); //getting first line
Flux<SomeDto> objectFlux = lines.skip(1) //skip first line
.flatMapIterable(row -> //iterating over lines
columns.map(cols -> convert(cols, row))); //convert line into SomeDto object
So is it the right way?
So is it the right way?
There's always more than one way to cook an egg - but the code you have there seems odd / suboptimal for two main reasons:
I'd assume it's one line per record / DTO you want to extract, so it's a bit odd you're using flatMapIterable() rather than flatMap()
You're going to resubscribe to lines once for each line, when you re-evaluate that Mono. That's almost certainly not what you want to do. (Caching the Mono helps, but you'd still resubscribe at least twice.)
Instead you may want to look at using switchOnFirst(), which will enable you to dynamically transform the Flux based on the first element (the header in your case.) This means you can do something like so:
lines
.switchOnFirst((signal, flux) -> flux.zipWith(Flux.<String[]>just(signal.get().split(";")).repeat()))
.map(row -> convert(row.getT1(), row.getT2()))
Note this is a bear-bones example, in real-world use you'll need to check whether the signal actually has a value as per the docs:
Note that the source might complete or error immediately instead of emitting, in which case the Signal would be onComplete or onError. It is NOT necessarily an onNext Signal, and must be checked accordingly.
I am learning Java 11 reactor. I have seen this example:
StepVerifier.withVirtualTime(() -> Flux.interval(Duration.ofSeconds(1)).take(3600))
.expectSubscription()
.expectNextCount(3600);
This example just checks that with a Flux<Long> which increments 1 after every second till one hour, the final count is 3600.
But, is there any way to check the counter repeatedly after every second?
I know this:
.expectNoEvent(Duration.ofSeconds(1))
.expectNext(0L)
.thenAwait(Duration.ofSeconds(1))
But I have seen no way to repeatedly check this after every second, like:
.expectNoEvent(Duration.ofSeconds(1))
.expectNext(i)
.thenAwait(Duration.ofSeconds(1))
when i increments till 3600. Is there?
PS:
I tried to add verifyComplete() at last in a long-running tests and it will never end. Do I have to? Or just ignore it?
You can achieve what you wanted by using expectNextSequence. You have to pass an Iterable and include every element you expect to arrive. See my example below:
var longRange = LongStream.range(0, 3600)
.boxed()
.collect(Collectors.toList());
StepVerifier
.withVirtualTime(() -> Flux.interval(Duration.ofSeconds(1)).take(3600))
.expectSubscription()
.thenAwait(Duration.ofHours(1))
.expectNextSequence(longRange)
.expectComplete().verify();
If you don't add verifyComplete() or expectComplete().verify() then the JUnit test won't wait until elements arrive from the flux and just terminate.
For further reference see the JavaDoc of verify():
this method will block until the stream has been terminated
I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)
The idea is when I call publishSubject.onNext(someValue) multiple times I need to get only one value like debounce operator does, but it delivers the last value, and I need to skip all values except first in a bunch till I stop calling onNext() for 1 sec.
I've tried to use something like throttleFirst(1000,TimeUnit.MILLISECONDS) but it's not working like debounce, it just makes windows after every delivery and after 1 sec immediate deliver next value.
Try this:
// Observable<T> stream = ...;
stream.window(stream.debounce(1, TimeUnit.Seconds))
.flatMap(w -> w.take(1));
Explanation: If I understand you correctly, you want to emit items if none have been emitted for 1 second prior. This is equivalent to getting the first element following an item debounced by 1 second. The below marble diagram may also help:
You can use the first operator. Like:
Observable.first()
It will take only the first value
I have a requirement that my mapper may in some cases produce a new key/value for another mapper to handle. Is there a sane way to do this? I've thought about writing my own custom input format (queue?) to achieve this. Any Ideas? Thanks!
EDIT: I should clarify
Method 1
Map Step 1
(foo1, bar1) -> out1
(foo2, bar2) -> out2
(foo3, bar3) -> (fooA, barA), (fooB, barB)
(foo4, bar4) -> (fooC, barC)
Reduction Step 1:
(out1) -> ok
(out2) -> ok
((fooA, barA), (fooB, barB)) -> create Map Step 2
((fooC, barC)) -> also send this to Map Step 2
Map Step 2:
(fooA, barA) -> out3
(fooB, barB) -> (fooD, barD)
(fooC, barC) -> out4
Reduction Step 2:
(out3) -> ok
((fooD, barD)) -> create Map Step 3
(out4) -> ok
Map Step 3:
(fooD, barD) -> out5
Reduction Step 3:
(out5) -> ok
-- no more map steps. finished --
So it's fully recursive. Some key/values emit output for reduction, some generate new key/values for mapping. I don't really know how many Map or Reduction steps i may encounter on a given run.
Method 2
Map Step 1
(foo1, bar1) -> out1
(foo2, bar2) -> out2
(foo3, bar3) -> (fooA, barA), (fooB, barB)
(foo4, bar4) -> (fooC, barC)
(fooA, barA) -> out3
(fooB, barB) -> (fooD, barD)
(fooC, barC) -> out4
(fooD, barD) -> out5
Reduction Step 1:
(out1) -> ok
(out2) -> ok
(out3) -> ok
(out4) -> ok
(out5) -> ok
This Method would get the mapper to feed it's own input list. I'm not sure which way would be simpler in the end to implement.
The "Method 1" way of doing recursion through Hadoop forces you to run the full dataset through both Map and reduce for each "recursion depth". This implies that you must be sure how deep this can go AND you'll suffer a massive performance impact.
Can you say for certain that the recursion depth is limited?
If so then I would definitely go for "Method 2" and actually build the mapper in such a way that does the required recursion within one mapper call.
It's simpler and saves you a lot of performance.
Use oozie [Grid workflow definition language] to string together two M/R jobs with first one only having mapper.
http://yahoo.github.com/oozie
In best of my understanding Hadoop MR framework in the beginning of the job is planning what map tasks should be executed and is not ready for the new map tasks to appear dynamically.
I would suggest two possible solutions:
a) if you emit another pairs during map phase - feed them to the same mapper. So mapper will take its usual arguments and after processing will look into some kind of the internal local queue for additional pairs to process. It will work good if there are small sets of the secondary pairs, and data locality is not that important.
b) If indeed you are processing directories or something similar - you can iterate over the structure in the main of the job package, and built all splits you need right away.