Hadoop Recursive Map - java

I have a requirement that my mapper may in some cases produce a new key/value for another mapper to handle. Is there a sane way to do this? I've thought about writing my own custom input format (queue?) to achieve this. Any Ideas? Thanks!
EDIT: I should clarify
Method 1
Map Step 1
(foo1, bar1) -> out1
(foo2, bar2) -> out2
(foo3, bar3) -> (fooA, barA), (fooB, barB)
(foo4, bar4) -> (fooC, barC)
Reduction Step 1:
(out1) -> ok
(out2) -> ok
((fooA, barA), (fooB, barB)) -> create Map Step 2
((fooC, barC)) -> also send this to Map Step 2
Map Step 2:
(fooA, barA) -> out3
(fooB, barB) -> (fooD, barD)
(fooC, barC) -> out4
Reduction Step 2:
(out3) -> ok
((fooD, barD)) -> create Map Step 3
(out4) -> ok
Map Step 3:
(fooD, barD) -> out5
Reduction Step 3:
(out5) -> ok
-- no more map steps. finished --
So it's fully recursive. Some key/values emit output for reduction, some generate new key/values for mapping. I don't really know how many Map or Reduction steps i may encounter on a given run.
Method 2
Map Step 1
(foo1, bar1) -> out1
(foo2, bar2) -> out2
(foo3, bar3) -> (fooA, barA), (fooB, barB)
(foo4, bar4) -> (fooC, barC)
(fooA, barA) -> out3
(fooB, barB) -> (fooD, barD)
(fooC, barC) -> out4
(fooD, barD) -> out5
Reduction Step 1:
(out1) -> ok
(out2) -> ok
(out3) -> ok
(out4) -> ok
(out5) -> ok
This Method would get the mapper to feed it's own input list. I'm not sure which way would be simpler in the end to implement.

The "Method 1" way of doing recursion through Hadoop forces you to run the full dataset through both Map and reduce for each "recursion depth". This implies that you must be sure how deep this can go AND you'll suffer a massive performance impact.
Can you say for certain that the recursion depth is limited?
If so then I would definitely go for "Method 2" and actually build the mapper in such a way that does the required recursion within one mapper call.
It's simpler and saves you a lot of performance.

Use oozie [Grid workflow definition language] to string together two M/R jobs with first one only having mapper.
http://yahoo.github.com/oozie

In best of my understanding Hadoop MR framework in the beginning of the job is planning what map tasks should be executed and is not ready for the new map tasks to appear dynamically.
I would suggest two possible solutions:
a) if you emit another pairs during map phase - feed them to the same mapper. So mapper will take its usual arguments and after processing will look into some kind of the internal local queue for additional pairs to process. It will work good if there are small sets of the secondary pairs, and data locality is not that important.
b) If indeed you are processing directories or something similar - you can iterate over the structure in the main of the job package, and built all splits you need right away.

Related

How to parallelize sequential stream in Java

I need to build a sort of parallel pipeline. Pipeline consists of 3 steps
readLines(): read 100.000 lines from filed
firstQuery(): build and send query to DB using 100.000 lines from readLines()
secondQuery(): build and send another query using firstQuery() results
readLines() takes 1 second, returns Stream<List<Line>>
firstQuery() takes 10 seconds returns Stream<List<FirstResult>>
secondQuery() takes 2 minutes Stream<List<SecondResult>>
No matter what I do with Executors, futures all methods are executed in sequential manner.
I suppose it's because of readLines() which returns Stream<List<Line>>, but it can't be paralleled since read from file is sequential.
What is the right way to do smth like that:
readLines() // limit concurrency to 2 (not more than 00.000 or lines in memory at the same time)
// as soon as lines are read from file, pass them to async
1: List<Line> -> firstQuery() -> secondQuery()
2: List<Line> -> firstQuery() -> secondQuery()
3: List<Line> -> firstQuery() -> secondQuery()
4: List<Line> -> firstQuery() -> secondQuery()
... more here
10: List<Line> -> firstQuery() -> secondQuery()
How could it look like?
What I've tried
add parallel to readLines() which return Stream<List<Line>>
no luck
add parallel, custom executors with Futures to firstQuery() -> secondQuery(), no luck still sequential

How to handle concurrency in API?

Context:
I have a storage/DB where I store a <String, List<Integer>> -> <uniqueName, numbersPresentForTheName> mapping each time:
I am writing an API that does the following:
How can we make sure this does not happen?
The usual answer is that you use a compare-and-update operation, rather than just an update operation.
t = 1 -> call readStorage
t = 10 -> get response from readStorage <ABCD, [1,2,3]>
t = 11 -> comapreAndUpdateStorage ABCD, [1,2,3] -> [1,2,3,4]
t = 5 -> Request 1 to the API: <ABCD, [5]>
t = 6 -> call readStorage
t = 11 -> get response from readStorage <ABCD, [1,2,3]> (note that we didn't get 4 as first call didnt update 4 yet)
t = 13 -> compareAnUpdateStorage ABCD, [1,2,3] -> [1,2,3,5] (This call will fail, because the current value is [1,2,3,4])
In other words, we're trying to address the lost edit problem by ensuring that the first edit always wins.
That's the first piece; the rest of the work is choosing an appropriate retry strategy.
If your two calls are on separate threads, and the steps of the two calls are interleaved over time, then you are seeing correct behavior.
If both threads retrieved a set of three values, and both threads replaced the existing set in storage with a new set of four values, then whichever thread saves their set last wins.
The key idea here to understand is atomic actions. The retrieval of the data and the update of the data in your system are separate actions. To achieve your aim, you must combine those actions two actions into one. This combining is the purpose of transactions and locks in databases.
See also the correct Answer by VoiceOfUnreason along the same line.

Reactor Flux: how to parse file with header

Recently I started using project reactor 3.3 and I don't know what is the best way to handle flux of lines, with first line as column names, then use those column names to process/convert all other lines. Right now I'm doing this way:
Flux<String> lines = ....;
Mono<String[]> columns = Mono.from(lines.take(1).map(header -> header.split(";"))); //getting first line
Flux<SomeDto> objectFlux = lines.skip(1) //skip first line
.flatMapIterable(row -> //iterating over lines
columns.map(cols -> convert(cols, row))); //convert line into SomeDto object
So is it the right way?
So is it the right way?
There's always more than one way to cook an egg - but the code you have there seems odd / suboptimal for two main reasons:
I'd assume it's one line per record / DTO you want to extract, so it's a bit odd you're using flatMapIterable() rather than flatMap()
You're going to resubscribe to lines once for each line, when you re-evaluate that Mono. That's almost certainly not what you want to do. (Caching the Mono helps, but you'd still resubscribe at least twice.)
Instead you may want to look at using switchOnFirst(), which will enable you to dynamically transform the Flux based on the first element (the header in your case.) This means you can do something like so:
lines
.switchOnFirst((signal, flux) -> flux.zipWith(Flux.<String[]>just(signal.get().split(";")).repeat()))
.map(row -> convert(row.getT1(), row.getT2()))
Note this is a bear-bones example, in real-world use you'll need to check whether the signal actually has a value as per the docs:
Note that the source might complete or error immediately instead of emitting, in which case the Signal would be onComplete or onError. It is NOT necessarily an onNext Signal, and must be checked accordingly.

Understanding JavaPairRDD.reduceByKey function

I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)

Difference between combinebykey and aggregatebykey

I am very new to Apache spark so this question might not be well to ask, but I am not getting the difference between combinebykey and aggregatebykey and when to use which operation.
aggregateByKey takes an initial accumulator, a first lambda function to merge a value to an accumulator and a second lambda function to merge two accumulators.
combineByKey is more general and adds an initial lambda function to create the initial accumulator
Here an example:
val pairs = sc.parallelize(List(("prova", 1), ("ciao", 2),
("prova", 2), ("ciao", 4),
("prova", 3), ("ciao", 6)))
pairs.aggregateByKey(List[Any]())(
(aggr, value) => aggr ::: (value :: Nil),
(aggr1, aggr2) => aggr1 ::: aggr2
).collect().toMap
pairs.combineByKey(
(value) => List(value),
(aggr: List[Any], value) => aggr ::: (value :: Nil),
(aggr1: List[Any], aggr2: List[Any]) => aggr1 ::: aggr2
).collect().toMap
combineByKey is more general then aggregateByKey. Actually, the implementation of aggregateByKey, reduceByKey and groupByKey is achieved by combineByKey. aggregateByKey is similar to reduceByKey but you can provide initial values when performing aggregation.
As the name suggests, aggregateByKey is suitable for compute aggregations for keys, example aggregations such as sum, avg, etc. The rule here is that the extra computation spent for map side combine can reduce the size sent out to other nodes and driver. If your func satisfies this rule, you probably should use aggregateByKey.
combineByKey is more general and you have the flexibility to specify whether you'd like to perform map side combine. However, it is more complex to use. At minimum, you need to implement three functions: createCombiner, mergeValue, mergeCombiners.

Categories

Resources