Spark : remove duplicated rows with different values using groupBy

Spark : remove duplicated rows with different values using groupBy - java

I have a dataset ds like this:
ds.show():
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate rows (i.e. row 1 and row 2) for the given keys (id1,id2,id3), but at the same time only keep one row for duplicated rows with same value (i.e. row 3 and row 4). The expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove row 1 and row 2 because we have 2 values for the key group. But we keep only one row for row 3 and row 4 because the value is the same (instead of removing these two rows)
I try to achieve this using:
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.distinct().withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
This will return the correct result I want. However, I want to use groupBy() to achieve the same (instead of Window), but I don't know how to count when doing the groupBy() at the same time ?

Here is how you can do it with groupBy, count and first function
df.distinct()
.groupBy("id1", "id2", "id3")
.agg(count("value").as("count"), first("value").as("value"))
.filter($"count" < 2 )
.drop("count")
.show(false)
Output:
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+

Related

Spark : remove duplicated rows with different values but keep only one row for distinctive row

I have a dataset ds like this:
ds.show():
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate rows (i.e. row 1 and row 2) for the given keys (id1,id2,id3), but at the same time only keep one row for duplicated rows with same value (i.e. row 3 and row 4). The expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove row 1 and row 2 because we have 2 values for the key group. But we keep only one row for row 3 and row 4 because the value is the same (instead of removing these two rows)
I try to achieve this using:
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
Here is the related question:
Spark: remove all duplicated lines
But it's not working as expected because it will remove all the duplicated rows.
The reason that I want to do this is to join with another dataset, and not adding information from this dataset when we have multiple names for a same key group

You can drop duplicates before grouping, which gives you single record as below
df.dropDuplicates()
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
You can also specify the fields to be checked for duplicate as
df.dropDuplicates("id1", "id2", "id3", "value")
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
Output:
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+

You can distinct to get the only one row when it is duplicated.
df.distinct
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+
You can also use the groupBy method.
df.groupBy("id1", "id2", "id3", "value")
.agg(first("col1").as("col1"), ...)
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)

Spark Dataset - NullPointerException while doing a filter on dataset

I have 2 datasets with me as shown below. I'm trying to find out how many products are associated with each game. Basically, I'm trying to keep a count of the number of products associated.
scala> df1.show()
gameid | games | users | cnt_assoc_prod
-------------------------------------------
1 | cricket |[111, 121] |
2 | basketball|[211] |
3 | skating |[101, 100, 98] |
scala> df2.show()
user | products
----------------------
98 | "shampoo"
100 | "soap"
101 | "shampoo"
111 | "shoes"
121 | "honey"
211 | "shoes"
I'm trying to iterate through each of df1's users from the array and find the corresponding row in df2 by applying the filter on column matching the user.
df1.map{x => {
var assoc_products = new Set()
x.users.foreach(y => assoc_products + df2.filter(z => z.user == y).first().
products)
x.cnt_assoc_prod = assoc_products.size
}
While applying filter I get following Exception
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.logicalPlan(Dataset.scala:784)
at org.apache.spark.sql.Dataset.mapPartitions(Dataset.scala:344)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:307)
I'm using spark version 1.6.1.

You can explode the users column in df1, join with df2 on the user column, then do the groupBy count:
(df1.withColumn("user", explode(col("users")))
.join(df2, Seq("user"))
.groupBy("gameid", "games")
.agg(count($"products").alias("cnt_assoc_prod"))
).show
+------+----------+--------------+
|gameid| games|cnt_assoc_prod|
+------+----------+--------------+
| 3| skating| 3|
| 2|basketball| 1|
| 1| cricket| 2|
+------+----------+--------------+

With Apache Spark flattern the 2 first rows of each group with Java

Giving the following input table:
+----+------------+----------+
| id | shop | purchases|
+----+------------+----------+
| 1 | 01 | 20 |
| 1 | 02 | 31 |
| 2 | 03 | 5 |
| 1 | 03 | 3 |
+----+------------+----------+
I would like, grouping by id and based on the purchases, obtain the first 2 top shops as follow:
+----+-------+------+
| id | top_1 | top_2|
+----+-------+------+
| 1 | 02 | 01 |
| 2 | 03 | |
+----+-------+------+
I'm using Apache Spark 2.0.1 and the first table is the result of other queries and joins which are on a Dataset. I could maybe do this with the traditional java iterating over the Dataset, but I hope there is another way using the Dataset functionalities.
My first attempt was the following:
//dataset is already ordered by id, purchases desc
...
Dataset<Row> ds = dataset.repartition(new Column("id"));
ds.foreachPartition(new ForeachPartitionFunction<Row>() {
#Override
public void call(Iterator<Row> itrtr) throws Exception {
int counter = 0;
while (itrtr.hasNext()) {
Row row = itrtr.next();
if(counter < 2)
//save it into another Dataset
counter ++;
}
}
});
But then I were lost in how to save it into another Dataset. My goal is, at the end, save the result into a MySQL table.

Using window functions and pivot you can define a window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, row_number}
val w = Window.partitionBy(col("id")).orderBy(col("purchases").desc)
add row_number and filter top two rows:
val dataset = Seq(
(1, "01", 20), (1, "02", 31), (2, "03", 5), (1, "03", 3)
).toDF("id", "shop", "purchases")
val topTwo = dataset.withColumn("top", row_number.over(w)).where(col("top") <= 2)
and pivot:
topTwo.groupBy(col("id")).pivot("top", Seq(1, 2)).agg(first("shop"))
with result being:
+---+---+----+
| id| 1| 2|
+---+---+----+
| 1| 02| 01|
| 2| 03|null|
+---+---+----+
I'll leave converting syntax to Java as an exercise for the poster (excluding import static for functions the rest should be close to identical).

Update Multiple row using hibernate ORM

Table Name : Country
-----+-----------------+--------------------+-----------------
id | country_name | country_short_name | country_full_name
-----+-----------------+--------------------+-----------------
1 | Bagladesh | BD |Bagladesh
-----+-----------------+--------------------+-----------------
2 | Bagladesh | BCDD |sdriij
-----+-----------------+--------------------+-----------------
3 | India | IND |India
-----+-----------------+--------------------+-----------------
in laravel i update multiple row using
Country::where('country_name ', '=', "Bagladesh" )
->update(array('country_short_name' => "BD",
'country_full_name' => "Bangladesh",
));
i want to do using Hibernate

JTable getting values of specific column

How could I get the values under specific column in JTable
example :
_________________________
| Column 1 | Column 2 |
________________________
| 1 | a |
________________________
| 2 | b |
________________________
| 3 | c |
_________________________
How could I get the values under Column 1 that is [1, 2, 3]
In the form of some data structure ( preferable array)?

you can do something like this
ArrayList list = new ArrayList();
for(int i = 0;i<table.getModel().getRowCount();i++)
{
list.add(table.getModel().getValueAt(i,0)); //get the all row values at column index 0
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark : remove duplicated rows with different values using groupBy - java

Related

Spark : remove duplicated rows with different values but keep only one row for distinctive row

Spark Dataset - NullPointerException while doing a filter on dataset

With Apache Spark flattern the 2 first rows of each group with Java

Update Multiple row using hibernate ORM

JTable getting values of specific column

Categories

Resources