Apache Beam count of unique elements - java

I have an Apache Beam job, which injest data from PubSub and then load into BigQuery,
I transform PubSub message to pojo with fields
id,
name, count
Count mean the count of not unique elements into single ingest.
If i load from PubSub 3 elements, two of which are same, then i need to load into BigQuery 2 elements, one of them will have count 2.
I wonder how easily make it in Apache Beam.
I tried to make it wia DoFn or MapElements, but there i can process only single element.
I also tried to convert element to KV, and then count, but i have non determenistics coder.
In usual java app i can simple use equals or via Map, but here in Apache beam all is different.

The simple and right approach would be to use Count.<T>perElement(), like this :
Pipeline p = ...;
PCollection<T> elements = p.apply(...); // read elements
PCollection<KV<T, Long>> elementsCounts =
elements.apply(Count.<T>perElement());
PCollection<TableRow> results = elementsCounts.apply(ParDo.of(
new FormatOutputFn()));
Though, right, you need to have a deterministic elements coder for that. So if it's not case (as I understand from what you said above) you need to add a step before Count to transform an element into different representation where it will be possible to have a deterministic coder (like AvroCoder, for example).
If it's not possible for some reasons, then another workaround could be to calculate an uniq hash for every element (but the hash value must be deterministic as well), create a KV for every element with new hash as a Key and element as a Value and use GroupByKey downstream to have a grouped tuple of the same values.
Also, please note, that since PubSub is an unbounded source, you need to "window" your input by any type of Windows strategy (except Global one) since all your group/combine operations should be done inside a window. Take a look on WindowedWordCount as an example of solution for similar problem.

Related

What is the easiest way to implement offset and limit in AWS DynamoDB?

I am currently working on DynamoDB and I saw there is no inbuilt function to put offset and limit like SQL queries. The only thing I have seen is get the lastEvaluatedKey and pass as exclusiveStartKey. But my problem is according to our scenarios we can't get the lastEvaluatedKey and populate because we are sorting data and parallel-stream the data.
So there will be issues and hard points. I just need to know what is the clean way or are there any best way we can pass the offset and limit and get the data without iterating over all the data. Because right now even though there is offset and limit put to bound-iterator it is inside get and iterating all the data in DynamoDB which consumes lots of read capacities even though we don't need the other data.
Map<String, AttributeValue> valueMap = new HashMap<>();
valueMap.put(":v1", new AttributeValue().withS(id));
Map<String, String> nameMap = new HashMap<>();
nameMap.put("#PK", "PK");
DynamoDBQueryExpression<TestClass> expression = new DynamoDBQueryExpression<TestClass>()
.withKeyConditionExpression("#PK = :v1")
.withExpressionAttributeNames(nameMap)
.withExpressionAttributeValues(valueMap)
.withConsistentRead(false);
PaginatedQueryList<TestClass> testList = dynamoDBMapper.query(TestClass.class, expression);
Iterator<TestClass> iterator = IteratorUtils.boundedIterator(testList.iterator(), offset, limit);
return IteratorUtils.toList(iterator);
What will be the best way to handle this issue?
A Query request has a Limit option just like you wanted. As for the offset, you have the ExclusiveStartKey option, which says at which sort key inside the long partition you want to start. Although usually one pages through a long partition by setting ExclusiveStartKey to the LastEvaluatedKey of the previous page, you don't strictly need to do this, and can actually pass any existing or even non-existing item, and the query will start after that key (that's the meaning of the word exclusive, i.e., it excludes).
But when you said "offset", you probably meant a numeric offset - e.g., start at the 1000th item in the partition. Unfortunately, this is not supported by DynamoDB. You can approximate it if your sort key (or LSI key) is a numeric offset if the item (only practical if you only append to the partition...) or using some additional data structures, but it's not supported by DynamoDB itself.

How to use array of doubles as feature vector in Mallet

From what I've seen in documentation and various examples,
typical worfklow with data in Mallet requires you to work with feature list that you usually obtain by passing your data through "pipes" while iterating over them with some sort of iterator. The data is ususally stored in some csv file.
I am trying to obtain features list from two arrays of doubles.
One array stores actual features and is of size n x m (where n is amount of features and m is count of feature vectors) and other one of size 1 x m and contains binary labels. How should I convert those into feature list, so I can use them in classifiers.
I ended up writing custom Itereator similar to the one present in Mullet called "ArrayDataAndTargetIterator". I also had to use a pipe defined like this:
new SerialPipes(Arrays.asList(new Target2Label(), new Array2FeatureVector()));

get the highest key from a javaPairRDD

i have a javaPairRDD called "rdd", its tuples defined as:
<Integer,String[]>
i want to extract the highest key using max() function but it requires a Comparator as an argument, would you give me an example how to do it, please !!!
example:
rdd={(22,[ff,dd])(8,[hh,jj])(6,[rr,tt]).....}
after applying rdd.max(....) , it sould give me:
int max_key=22;
help me please...in java please
Your approach isn't working because tuples don't have an inherent ordering.
What you're trying to do is get the maximum of the keys. The easiest way to do this would be to extract the keys and then get the max like so
keyRdd = rdd.keys()
max_key = keyRdd.max()
Note: Not a javaSpark user, so the syntax may be a bit off.
even that #David's answer was so logic it didn't work for me and it always requires a Comparator, and when i used a Comparator it appeared an exception (not serialisable operation, so i tried with Ordering but this time the max-key was 1 (means the min in fact), so finally, i used the easiest way ever, i sorted my pairRDD descendantly then i extracted the first() tuple.
int max-key=rdd.first()._1;

Aligning number of elements in partition in Java Apache Spark

I have two JavaRDD<Double> called rdd1 and rdd2 over which I'd like to evaluate some correlation, e.g. with Statistics.corr(). The two RDDs are generated with many transformations and actions, but at the end of the process, they both have the same number of elements. I know that two conditions must be respected in order to evaluate the correlation, that are related (as far as I understood) to the zip method used in the correlation function. Conditions are:
The RDDs must be split over the same number of partitions
Every partitions must have the same number of elements
Moreover, according to the Spark documentation, I'm using methods over the RDD which preserve ordering, so that the final correlation will be correct (although this wouldn't raise any exception). Now, the problem is that even if I'm able to keep the number of partition consistent, for example with the code
JavaRDD<Double> rdd1Repatitioned = rdd1.repartition(rdd2.getNumPartitions());
what I don't know how to do (and what is giving me exceptions) is to control the number of entries in every partition. I found a workaround that, for now, is working, that is re-initializing the two RDDs I want to correlate
List<Double> rdd1Array = rdd1.collect();
List<Double> rdd2Array = rdd2.collect();
JavaRDD<Double> newRdd1 = sc.parallelize(rdd1Array);
JavaRDD<Double> newRdd2 = sc.parallelize(rdd2Array);
but I'm not sure this guarantees me anything about the consistency. Second, it might be really expensive computational-wise in some situations. Is there a way to control the number of elements in each partition, or in general to realign the partitions in two or more RDDs (I know more or less how the partitioning system works, and I understand that this might be complicated from the distribution point of view)?
Ok, this worked for me:
Statistics.corr(rdd1.repartition(8), rdd2.repartition(8))

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.
Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:
Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.
My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?
Thanks,
Chander
Check out merge-sort.
It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.
If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.
Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.
As others have mentioned, merging is much simpler than sorting, so there's a big win there.
However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.
One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.
So the simplest way to sort using map-reduce (though the not the most efficient one) is to do the following
During the Map Phase
(Input_Key, Input_Value) emit out (Input_Value,Input Key)
Reducer is an Identity Reduceer
So for example if our data is a student, age database then your mapper input would be
('A', 1) ('B',2) ('C', 10) ... and the output would be
(1, A) (2, B) (10, C)
Haven't tried this logic out but it is step in a homework problem I am working on. Will put an update source code/ logic link.
Sorry for being late but for future readers, yes, Chander, you are missing something.
Logic is that Reducer can handle shuffled and then sorted data of its node only on which it is running. I mean reducer that run at one node can't look at other node's data, it applies the reduce algorithm on its data only. So merging procedure of merge sort can't be applied.
So for big data we use TeraSort, which is nothing but identity mapper and reducer with custom partitioner. You can read more about it here Hadoop's implementation for TeraSort. It states:
"TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
I think, combining multiple sorted items is efficient than combining multiple unsorted items. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.
Sorting can be efficiently implemented using MapReduce. But you seem to be thinking about implementing merge-sort using mapreduce to achieve this purpose. It may not be the ideal candidate.
Like you alluded to, the mergesort (with map-reduce) would involve following steps:
Partition the elements into small groups and assign each group to the mappers in round robin manner
Each mapper will sort the subset and return {K, {subset}}, where K is same for all the mappers
Since same K is used across all mappers, only one reduce and hence only one reducer. The reducer can merge the data and return the sorted result
The problem here is that, like you mentioned, there can be only one reducer which precludes the parallelism during reduction phase. Like it was mentioned in other replies, mapreduce specific implementations like terasort can be considered for this purpose.
Found the explanation at http://www.chinacloud.cn/upload/2014-01/14010410467139.pdf
Coming back to merge-sort, this would be feasible if the hadoop (or equivalent) tool provides hierarchy of reducers where output of one level of reducers goes to the next level of reducers or loop it back to the same set of reducers

Categories

Resources