Aligning number of elements in partition in Java Apache Spark - java

I have two JavaRDD<Double> called rdd1 and rdd2 over which I'd like to evaluate some correlation, e.g. with Statistics.corr(). The two RDDs are generated with many transformations and actions, but at the end of the process, they both have the same number of elements. I know that two conditions must be respected in order to evaluate the correlation, that are related (as far as I understood) to the zip method used in the correlation function. Conditions are:
The RDDs must be split over the same number of partitions
Every partitions must have the same number of elements
Moreover, according to the Spark documentation, I'm using methods over the RDD which preserve ordering, so that the final correlation will be correct (although this wouldn't raise any exception). Now, the problem is that even if I'm able to keep the number of partition consistent, for example with the code
JavaRDD<Double> rdd1Repatitioned = rdd1.repartition(rdd2.getNumPartitions());
what I don't know how to do (and what is giving me exceptions) is to control the number of entries in every partition. I found a workaround that, for now, is working, that is re-initializing the two RDDs I want to correlate
List<Double> rdd1Array = rdd1.collect();
List<Double> rdd2Array = rdd2.collect();
JavaRDD<Double> newRdd1 = sc.parallelize(rdd1Array);
JavaRDD<Double> newRdd2 = sc.parallelize(rdd2Array);
but I'm not sure this guarantees me anything about the consistency. Second, it might be really expensive computational-wise in some situations. Is there a way to control the number of elements in each partition, or in general to realign the partitions in two or more RDDs (I know more or less how the partitioning system works, and I understand that this might be complicated from the distribution point of view)?

Ok, this worked for me:
Statistics.corr(rdd1.repartition(8), rdd2.repartition(8))

Related

Apache Beam count of unique elements

I have an Apache Beam job, which injest data from PubSub and then load into BigQuery,
I transform PubSub message to pojo with fields
id,
name, count
Count mean the count of not unique elements into single ingest.
If i load from PubSub 3 elements, two of which are same, then i need to load into BigQuery 2 elements, one of them will have count 2.
I wonder how easily make it in Apache Beam.
I tried to make it wia DoFn or MapElements, but there i can process only single element.
I also tried to convert element to KV, and then count, but i have non determenistics coder.
In usual java app i can simple use equals or via Map, but here in Apache beam all is different.
The simple and right approach would be to use Count.<T>perElement(), like this :
Pipeline p = ...;
PCollection<T> elements = p.apply(...); // read elements
PCollection<KV<T, Long>> elementsCounts =
elements.apply(Count.<T>perElement());
PCollection<TableRow> results = elementsCounts.apply(ParDo.of(
new FormatOutputFn()));
Though, right, you need to have a deterministic elements coder for that. So if it's not case (as I understand from what you said above) you need to add a step before Count to transform an element into different representation where it will be possible to have a deterministic coder (like AvroCoder, for example).
If it's not possible for some reasons, then another workaround could be to calculate an uniq hash for every element (but the hash value must be deterministic as well), create a KV for every element with new hash as a Key and element as a Value and use GroupByKey downstream to have a grouped tuple of the same values.
Also, please note, that since PubSub is an unbounded source, you need to "window" your input by any type of Windows strategy (except Global one) since all your group/combine operations should be done inside a window. Take a look on WindowedWordCount as an example of solution for similar problem.

special case of constraint satisfaction efficient solution

I'm trying to solve a special case of the general constraint satisfaction problem in java.
Basically I have multiple variables, each one taking discrete values, and every variable is defined by the set of all possible values it has (think of it like an enumeration in Java, that would help).
I also have multiple groupements of conditions (think of a condition as a system of multiple equations on the variables, and they are all unary constraints: in other words of the form variable = possible value), the goal is to find if there's a set of variable values that satisfies at least one condition from each group (it might satisfy multiple ones from the same group). I will call this particular set a solution. What I'm looking for is all possible solutions.
The only Idea I have so far is basically brute force.
This is a concrete example so things are clearer:
s = {a,b,c}, v = {1,2,3}, n = {p,k,m}.
First condition group:
c1 = {s=a and v=2}, c2 = {s=b}.
Second condition group:
c1={n=p and v=2}.
Third condition group:
c1={s=a and n=p}, c2 = {s=c}.
In this situation, if we take (s=a,v=2,n=p): it satisfies the first condition of all three groups, and is, therefore, a solution to the problem.
(s=b,v=2,n=p) however is not a solution, because it doesn't verify any of the third group's conditions. In fact, the number of possible solutions here is 1.
Please note that the conditions within a group are not necessarily mutually exclusive.
Any insight into a possible way to go more efficiently than by brute force be it a data structure or an algorithm would be great since I will have to solve millions of such systems of quite the number of variables (thirty variables tops of around 15 values each, and a hundred such conditions tops).
Edit1: Data Constraints
If N is the number of variables each problem will have then N<=30.
If |V| is the maximum number of elements a variable V can have, then I know that Max(|Vi|)<=15 for every variable Vi in a problem.
I also know that if C is the number of constraints per problem, then C<100.
Lastly, I know that statistically speaking, the number of solutions for the problem will be small, meaning that most problems will have one single solution, and the likelihood of having more than 8 solutions is less than 99% of the time. For the sake of optimization, we can even assume that I'm never interested in any problem that has more than 10 solutions ever.

spark - How to reduce the shuffle size of a JavaPairRDD<Integer, Integer[]>?

I have a JavaPairRDD<Integer, Integer[]> on which I want to perform a groupByKey action.
The groupByKey action gives me a:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
which is practically an OutOfMemory error, if I am not mistaken. This occurs only in big datasets (in my case when "Shuffle Write" shown in the Web UI is ~96GB).
I have set:
spark.serializer org.apache.spark.serializer.KryoSerializer
in $SPARK_HOME/conf/spark-defaults.conf, but I am not sure if Kryo is used to serialize my JavaPairRDD.
Is there something else that I should do to use Kryo, apart from setting this conf parameter, to serialize my RDD? I can see in the serialization instructions that:
Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.
and that:
Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.
I also noticed that when I set spark.serializer to be Kryo, the Shuffle Write in the Web UI increases from ~96GB (with default serializer) to 243GB!
EDIT: In a comment, I was asked about the logic of my program, in case groupByKey can be replaced with reduceByKey. I don't think it's possible, but here it is anyway:
Input has the form:
key: index bucket id,
value: Integer array of entity ids in this bucket
The shuffle write operation produces pairs in the form:
entityId
Integer array of all entity Ids in the same bucket (call them neighbors)
The groupByKey operation gathers all the neighbor arrays of each entity, some possibly appearing more than once (in many buckets).
After the groupByKey operation, I keep a weight for each bucket (based on the number of negative entity ids it contains) and for each neighbor id I sum up the weights of the buckets it belongs to.
I normalize the scores of each neighbor id with another value (let's say it's given) and emit the top-3 neighbors per entity.
The number of distinct keys that I get is around 10 million (around 5 million positive entity ids and 5 million negatives).
EDIT2: I tried using Hadoop's Writables (VIntWritable and VIntArrayWritable extending ArrayWritable) instead of Integer and Integer[], respectively, but the shuffle size was still bigger than the default JavaSerializer.
Then I increased the spark.shuffle.memoryFraction from 0.2 to 0.4 (even if deprecated in version 2.1.0, there is no description of what should be used instead) and enabled offHeap memory, and the shuffle size was reduced by ~20GB. Even if this does what the title asks, I would prefer a more algorithmic solution, or one that includes a better compression.
Short Answer: Use fastutil and maybe increase spark.shuffle.memoryFraction.
More details:
The problem with this RDD is that Java needs to store Object references, which consume much more space than primitive types. In this example, I need to store Integers, instead of int values. A Java Integer takes 16 bytes, while a primitive Java int takes 4 bytes. Scala's Int type, on the other hand, is a 32-bit (4-byte) type, just like Java's int, that's why people using Scala may not have faced something similar.
Apart from increasing the spark.shuffle.memoryFraction to 0.4, another nice solution was to use the fastutil library, as suggest in Spark's tuning documentation:
The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this: Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.
This enables storing each element in int array of my RDD pair as an int type (i.e., using 4 bytes instead of 16 for each element of the array). In my case, I used IntArrayList instead of Integer[]. This made the shuffle size drop significantly and allowed my program to run in the cluster. I also used this library in other parts of the code, where I was making some temporary Map structures. Overall, by increasing spark.shuffle.memoryFraction to 0.4 and using fastutil library, shuffle size dropped from 96GB to 50GB (!) using the default Java serializer (not Kryo).
Alternative: I have also tried sorting each int array of an rdd pair and storing the deltas using Hadoop's VIntArrayWritable type (smaller numbers use less space than bigger numbers), but this also required to register VIntWritable and VIntArrayWritable in Kryo, which didn't save any space after all. In general, I think that Kryo only makes things work faster, but does not decrease the space needed, but I am not still sure about that.
I am not marking this answer as accepted yet, because someone else might have a better idea, and because I didn't use Kryo after all, as my OP was asking. I hope reading it, will help someone else with the same issue. I will update this answer, if I manage to further reduce the shuffle size.
Still not really sure what you want to do. However, because you use groupByKey and say that there is no way to do it by using reduceByKey, it makes me more confused.
I think you have rdd = (Integer, Integer[]) and you want something like (Integer, Iterable[Integer[]]) that's why you are using groupByKey.
Anyway, I am not really familiar with Java in Spark, but in Scala I would use reduceByKey to avoid the shuffle by
rdd.mapValues(Iterable(_)).reduceByKey(_++_) . Basically, you want to convert the value to a list of array and then combine the list together.
I think the best approach that can be recommended here (without more specific knowledge of the input data) in general is to use the persist API on your input RDD.
As step one, I'd try to call .persist(MEMORY_ONLY_SER) on the input, RDD to lower memory usage (albeit at a certain CPU overhead, that shouldn't be that much of a problem for ints in your case).
If that is not sufficient you can try out .persist(MEMORY_AND_DISK_SER) or if your shuffle still takes so much memory that the input dataset needs to be made easier on the memory .persist(DISK_ONLY) may be an option, but one that will strongly deteriorate performance.

FilterOperator NOT IN

com.google.appengine.api.datastore.Query.FilterOperator enum does not have a NOT_IN value. All other operations are possible (equal, not equal and all inequalities). Is it possible to create FilterPredicate with that behaviour (e.g., "id", notIn(), new int[] { 3, 4, 7 }, where notIn() is something that will make the query return all values except for those whose id's were found in the list given)? If not, them how can I query the datastore like that? Something like negating the FilterPredicate, for example.
There isn't (as far as I know) server-side support for that type of query. Your best bet for simulating it client-side is to merge the result of three queries: one for elements below the min of the set, one for elements above the max of the set, and one for for [min..max] where you perform the not in in code on the client side.
(Added) You can perform all three queries in parallel to save wall time. A challenge will emerge if any of the queries returns a sufficiently large number of entities to either blow memory or exceed time limits.

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.
Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:
Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.
My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?
Thanks,
Chander
Check out merge-sort.
It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.
If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.
Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.
As others have mentioned, merging is much simpler than sorting, so there's a big win there.
However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.
One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.
So the simplest way to sort using map-reduce (though the not the most efficient one) is to do the following
During the Map Phase
(Input_Key, Input_Value) emit out (Input_Value,Input Key)
Reducer is an Identity Reduceer
So for example if our data is a student, age database then your mapper input would be
('A', 1) ('B',2) ('C', 10) ... and the output would be
(1, A) (2, B) (10, C)
Haven't tried this logic out but it is step in a homework problem I am working on. Will put an update source code/ logic link.
Sorry for being late but for future readers, yes, Chander, you are missing something.
Logic is that Reducer can handle shuffled and then sorted data of its node only on which it is running. I mean reducer that run at one node can't look at other node's data, it applies the reduce algorithm on its data only. So merging procedure of merge sort can't be applied.
So for big data we use TeraSort, which is nothing but identity mapper and reducer with custom partitioner. You can read more about it here Hadoop's implementation for TeraSort. It states:
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
I think, combining multiple sorted items is efficient than combining multiple unsorted items. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.
Sorting can be efficiently implemented using MapReduce. But you seem to be thinking about implementing merge-sort using mapreduce to achieve this purpose. It may not be the ideal candidate.
Like you alluded to, the mergesort (with map-reduce) would involve following steps:
Partition the elements into small groups and assign each group to the mappers in round robin manner
Each mapper will sort the subset and return {K, {subset}}, where K is same for all the mappers
Since same K is used across all mappers, only one reduce and hence only one reducer. The reducer can merge the data and return the sorted result
The problem here is that, like you mentioned, there can be only one reducer which precludes the parallelism during reduction phase. Like it was mentioned in other replies, mapreduce specific implementations like terasort can be considered for this purpose.
Found the explanation at http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf
Coming back to merge-sort, this would be feasible if the hadoop (or equivalent) tool provides hierarchy of reducers where output of one level of reducers goes to the next level of reducers or loop it back to the same set of reducers

Categories

Resources