Best practice to perform two times groupbykey in Spark?

Best practice to perform two times groupbykey in Spark? - java

I've a lot of tuples with this kind of format:
(1,200,a)
(2,300,a)
(1,300,b)
(2,400,a)
(2,500,b)
(3,200,a)
(3,400,b)
(1,500,a)
(2,400,b)
(3,500,a)
(1,200,b)
My job is to sort in firt time the tuple for the first integer, and then doing the average of the value in the second element of the tuple for each element of the third element of the tuple.
So, the result should be this:
(1,350,a),
(1,250,b),
(2,350,a),
(2,450,b),
(3,350,a),
(3,400,b).
What kind of best practice do you reccomend in this case?
I've tried to do MaptoPair and then groupbykey for the first element of the tuple. Then another MapTopPair and groupbykey for the third element and then reducebykey, but it doesn't work and i don't know why. I don't think i've used the best practice for resolving this type of job.
This is a sketch of my solution

Just use Dataset API. Here in Scala, but Java will be almost identical:
val rdd = sc.parallelize(Seq(
(1,200,"a"), (2,300,"a"), (1,300,"b"), (2,400,"a"), (2,500,"b"),
(3,200,"a"), (3,400,"b"), (1,500,"a"), (2,400,"b"), (3,500,"a"),
(1,200,"b")
))
val df = rdd.toDF("k1", "v", "k2")
df.groupBy("k1", "k2").mean("v").orderBy("k1", "k2").show
+---+---+------+
| k1| k2|avg(v)|
+---+---+------+
| 1| a| 350.0|
| 1| b| 250.0|
| 2| a| 350.0|
| 2| b| 450.0|
| 3| a| 350.0|
| 3| b| 400.0|
+---+---+------+
With RDD map first to have composite key:
rdd
.map(x => ((x._1, x._3), (x._2, 1.0)))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.take(6).foreach(println)
((2,a),350.0)
((3,b),400.0)
((1,b),250.0)
((1,a),350.0)
((3,a),350.0)
((2,b),450.0)

Related

Compare two strings in lamdba comparator function placing one in front of the other depending on if the strings contain a word [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm just coming to grips with lambda functions in Java
I have an array list of class objects.
cars.add(new Car( "BMW", "1 Series", 39345));
cars.add(new Car( "Nissan", "micra", 16895 ));
cars.add(new Car( "Volkswagon", "Golf", 23950));
cars.add(new Car( "Skoda", "Superb", 32080));
cars.add(new Car( "Kia", "Sportage", 36450));
I want to sort the cars based on model so for example I want all Skoda cars placed at the beginning of the array list.
I know for example how to sort the cars by price because it's simply comparing two prices.
Comparator<Car> byCost = (Car obj1, Car obj2) -> obj1.getPrice() -
obj2.getPrice();
Collections.sort(cars, byCost);
I don't know how to use the Comparator function to sort the cars by name. Since I'm comparing two boolean values by using the .contains method, I cannot use the Comparator interface method like I have above. So this is what I've tried.
Comparator<Car> bySkoda = (Car obj1, Car obj2) -> {
if(obj1.getModel().contains("Skoda"))
return 1;
else
return -1;
};
Collections.sort(cars, bySkoda);
This is of course not how to do it. I would like a pointer as to how I can achieve this using a lambda Comparator interface?

Using ternary operator may help you to get rid of if statement inside the lambda:
final String c = "Skoda";
Comparator<Car> bySkoda = (car1, car2) ->
c.equals(car1.getModel()) ^ c.equals(car2.getModel())
? c.equals(car1.getModel()) ? -1 : 1 // either of c1 or c2 is "Skoda"
: car1.getModel().compareTo(car2.getModel()); // both or none of c1, c2 is "Skoda"
Notes:
the type of arguments in lambdas may be skipped if it can be inferred
it seems to be fine to use equals instead of contains
to place Skoda models first, we use XOR to check if both car1 and car2 are "Skoda"
|--------------------|-------------------|-----------|--------------|
| "Skoda" | "Skoda" | XOR | Result |
|.equals(car1.model) |.equals(car2.model)| | |
|--------------------|-------------------|-----------|--------------|
| true | true | false | 0 |
| true | false | true | -1 |
| false | true | true | 1 |
| | | | car1.model |
| false | false | false | .compareTo |
| | | | (car2.model) |
|--------------------|-------------------|-----------|--------------|
it seems reasonable to keep other models sorted in alphabet order
This may be updated to provide a sorted list with a preferable model first:
public static List<Car> sortWithPreferableModelFirst(List<Car> cars, String model) {
return cars.stream()
.sorted((car1, car2) ->
model.equalsIgnoreCase(car1.getModel()) ^ model.equalsIgnoreCase(car2.getModel())
? model.equalsIgnoreCase(car1.getModel()) ? -1 : 1
: car1.getModel().compareTo(car2.getModel()))
.collect(Collectors.toList());
}
cars = sortWithPreferableModelFirst(cars, "Skoda");

Stream Filter List based on Combination of values from another List

Need: To filter out data in list - 1 based on the values present in list - 2 with multiple criteria i.e. combination of Date & Order Number
Issue: Able to filter based on 1 criteria. But when I try adding another filter condition it treats it as 2 separate & not as combination. Unable to figure out how to make it as a combination.
Hope issue faced is clear.
Research: I referred to my earlier query on similar need - Link1 . Also checked - Link2
List 1: (All Orders)
[Date | OrderNumber | Time | Company | Rate ]
[2014-10-01 | 12345 | 10:00:01 | CompA | 1000]
[2015-03-01 | 23456 | 08:00:01 | CompA | 2200]
[2016-08-01 | 34567 | 09:00:01 | CompA | 3300]
[2017-09-01 | 12345 | 11:00:01 | CompA | 4400]
[2017-09-01 | 98765 | 12:00:01 | CompA | 7400]
List 2: (Completed Orders)
[Date | OrderNumber | Time]
[2014-10-01 | 12345 | 10:00:01]
[2015-03-01 | 23456 | 08:00:01]
[2016-08-01 | 34567 | 09:00:01]
[2017-09-01 | 98765 | 12:00:01]
Expected O/p after filter :
[Date | OrderNumber | Time | Company | Rate]
[2017-09-01 | 12345 | 11:00:01 | CompA | 4400]
Code:
// Data extracted from MySQL database
// List 1: All Orders
List<ModelAllOrders> listOrders = getDataFromDatabase.getTable1();
// List 2: Completed Orders
List<ModelCompletedOrders> listCompletedOrders = getDataFromDatabase.getTable2();
// Filter with 1 criteria works
Set<Integer> setOrderNumbers = listCompletedOrders.stream().map(ModelCompletedOrders::getOrderNumber).collect(Collectors.toSet());
listOrders = listOrders.stream().filter(p -> !setOrderNumbers.contains(p.getOrderNumber()).collect(Collectors.toList());
// Below not working as expected when trying to combinational filter
Set<LocalDate> setDates = listCompletedOrders.stream().map(ModelCompletedOrders::getDate).collect(Collectors.toSet());
listOrders = listOrders.stream().filter(p -> !setDates.contains(p.getDate()) && !setOrderNumbers.contains(p.getOrderNumber()))
.collect(Collectors.toList());

You've asked for logic that will do this:
The combination of Date & Order Number is unique. I need to check if that unique combination is present in List-2, if yes then filter out, if not then output should contain that row.
Stream::filter() will return a subset of the stream where the filter predicate returns true (i.e. it filters out those objects in the stream where the predicate is false).
listOrders = listOrders.stream().filter(p -> !setDates.contains(p.getDate()) && !setOrderNumbers.contains(p.getOrderNumber()))
.collect(Collectors.toList());
Your code expression here says "show me orders where the order's date does not appear in the list of prior orders AND where the order's order number does not appear in the list of prior orders". Your logical expression is wrong (you're getting confused between what in electronics would be called positive vs negative logic).
You want either:
listOrders = listOrders.stream().filter(p -> !(setDates.contains(p.getDate()) && setOrderNumbers.contains(p.getOrderNumber())))
.collect(Collectors.toList());
"show me orders where both the order's date and order's id are not
present in the list of prior orders"
or:
listOrders = listOrders.stream().filter(p -> !setDates.contains(p.getDate()) || !setOrderNumbers.contains(p.getOrderNumber()))
.collect(Collectors.toList());
"show me orders where either the order's date has not been seen before
OR the order's id has not been seen before"

Spark: Print the first ten rows of every year

I have dataset in Spark where there are two columns: the string column (the string consists of the year in the first 4 characters, while the remaining characters are of a word) and the column Integer. Example of dataset row: "2004-dog" 45. I don't know how to print the first ten rows of every year. I arrived at this point:
JavaRDD<String> mentions =
tweets.flatMap(s -> Arrays.asList(s.split(":")).iterator());
JavaPairRDD<String, Integer> counts =
mentions.mapToPair(mention -> new Tuple2<>(mention, 1))
.reduceByKey((x, y) -> x + y);

Here is just an example:
Input Data:
+-------+---+
| year|cnt|
+-------+---+
|2015:04| 50|
|2015:04| 40|
|2015:04| 50|
|2017:04| 55|
|2017:04| 20|
|2017:04| 20|
+-------+---+
And, assuming you have some criteria to pick the top 10.
Create a window function
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy(desc("cnt"))
df.withColumn("year", split('year,":")(0))
.withColumn("rank", row_number.over(w)) //use rank/dense_rank as you need
.filter('rank <= 2) //replace 10 here
//.drop("rank") //you can drop rank if you want
.show()
Result:
+----+---+----+
|year|cnt|rank|
+----+---+----+
|2017| 55| 1|
|2017| 20| 2|
|2015| 50| 1|
|2015| 50| 2|
+----+---+----+
Hope this helps!

Create dataframe from rdd objectfile

What is the method to create ddf from an RDD which is saved as objectfile. I want to load the RDD but I don't have a java object, only a structtype I want to use as schema for ddf.
I tried retrieving as Row
val myrdd = sc.objectFile[org.apache.spark.sql.Row]("/home/bipin/"+name)
But I get
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to
org.apache.spark.sql.Row
Is there a way to do this.
Edit
From what I understand, I have to read rdd as array of objects and convert it to row. If anyone can give a method for this, it would be acceptable.

If you have an Array of Object you only have to use the Row apply method for an array of Any. In code will be something like this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x))
EDIT
you are rigth #user568109 this will create a Dataframe with only one field that will be an Array to parse the whole array you have to do this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row.fromSeq(x.toSeq))
As #user568109 said there are other ways to do this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x:_*))
No matters which one you will because both are wrappers for the same code:
/**
* This method can be used to construct a [[Row]] with the given values.
*/
def apply(values: Any*): Row = new GenericRow(values.toArray)
/**
* This method can be used to construct a [[Row]] from a [[Seq]] of values.
*/
def fromSeq(values: Seq[Any]): Row = new GenericRow(values.toArray)

Let me add some explaination,
suppose you have a mysql table grocery with 3 columns (item,category,price) and its contents as below
+------------+---------+----------+-------+
| grocery_id | item | category | price |
+------------+---------+----------+-------+
| 1 | tomato | veg | 2.40 |
| 2 | raddish | veg | 4.30 |
| 3 | banana | fruit | 1.20 |
| 4 | carrot | veg | 2.50 |
| 5 | apple | fruit | 8.10 |
+------------+---------+----------+-------+
5 rows in set (0.00 sec)
Now, within spark you want to read it, your code will be something like below
val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => r.getString("item")+"|"+r.getString("price"))
Note :
In the above statement i converted the ResultSet into String r => r.getString("item")+"|"+r.getString("price")
So my JdbcRDD will be as
groceryRDD: org.apache.spark.rdd.JdbcRDD[String] = JdbcRDD[29] at JdbcRDD at <console>:21
now you save it.
groceryRDD.saveAsObjectFile("/user/cloudera/jdbcobject")
Answer to your question
while reading the object file you need to write as below,
val newJdbObjectFile = sc.objectFile[String]("/user/cloudera/jdbcobject")
In a blind manner ,just substitute the type Parameter of RDD you are saving.
In my case, groceryRDD has a type parameter as String, hence i have used the same
UPDATE:
In your case, as mentioned by jlopezmat, you need to use Array[Object]
Here each row of RDD will be Object, but since you have converted that using ObjectArray each row with its contents will be again saved as Array,
i.e, In my case , if save above RDD as below,
val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => JdbcRDD.resultSetToObjectArray(r))
when i read the same using and collect data
val newJdbcObjectArrayRDD = sc.objectFile[Array[Object]]("...")
val result = newJdbObjectArrayRDD.collect
result will be of type Array[Array[Object]]
result: Array[Array[Object]] = Array(Array(raddish, 4.3), Array(banana, 1.2), Array(carrot, 2.5), Array(apple, 8.1))
you can parse the above based on your column definitions.
Please let me know if it answered you question

SQLite : Adding a row only if results in unique rows

My SQLite table looks like this.
---------------------------
|_id|str A |str B |str C |
|---------------------------|
|1 |cat |blahty |lio |
|---------------------------|
|2 |dog |blahty |timmy |
|---------------------------|
|3 |cow |blahty |lio |
|---------------------------|
|4 |bat |blah |timmy |
|---------------------------|
|5 |tuna |blahty |timmy |
|---------------------------|
|6 |cat |bla |lio |
|---------------------------|
|7 |dog |blahty |timmy |
|---------------------------|
|8 |cow |bla |lion |
|---------------------------|
|9 |bat |blahty |timmy |
|---------------------------|
|10 |tuna |blahty |lio |
---------------------------
An I have an Array {new str A, new str B, new srt C} from which I want to insert values to respective columns. I want to do this only if new str A don't match any entry in column str A. And I also want to str A unique by removing multiple occurances of str A. How do I accomplish this via SQLite?

You should be able to use INSERT or IGNORE : http://www.sqlite.org/lang_conflict.html when inserting into the table?
I think you would also have to make Col A a unique identifier in the table

First get all the values from str A column then compare the value with your array value

One way you could have unique entries with str A is that you can make str A as the primary key instead of the _id and then use UPDATE to change the related values. This way, multiple occurances of str A will also be prevented, as the code will throw error when u try to INSERT an entry with a value of str A already present in the table. Hope this helps you out.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best practice to perform two times groupbykey in Spark? - java

Related

Compare two strings in lamdba comparator function placing one in front of the other depending on if the strings contain a word [closed]

Stream Filter List based on Combination of values from another List

Spark: Print the first ten rows of every year

Create dataframe from rdd objectfile

SQLite : Adding a row only if results in unique rows

Categories

Resources