I have a dataset ds like this:
ds.show():
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate rows (i.e. row 1 and row 2) for the given keys (id1,id2,id3), but at the same time only keep one row for duplicated rows with same value (i.e. row 3 and row 4). The expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove row 1 and row 2 because we have 2 values for the key group. But we keep only one row for row 3 and row 4 because the value is the same (instead of removing these two rows)
I try to achieve this using:
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.distinct().withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
This will return the correct result I want. However, I want to use groupBy() to achieve the same (instead of Window), but I don't know how to count when doing the groupBy() at the same time ?
Here is how you can do it with groupBy, count and first function
df.distinct()
.groupBy("id1", "id2", "id3")
.agg(count("value").as("count"), first("value").as("value"))
.filter($"count" < 2 )
.drop("count")
.show(false)
Output:
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+
Related
I have a dataset ds like this:
ds.show():
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate rows (i.e. row 1 and row 2) for the given keys (id1,id2,id3), but at the same time only keep one row for duplicated rows with same value (i.e. row 3 and row 4). The expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove row 1 and row 2 because we have 2 values for the key group. But we keep only one row for row 3 and row 4 because the value is the same (instead of removing these two rows)
I try to achieve this using:
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
Here is the related question:
Spark: remove all duplicated lines
But it's not working as expected because it will remove all the duplicated rows.
The reason that I want to do this is to join with another dataset, and not adding information from this dataset when we have multiple names for a same key group
You can drop duplicates before grouping, which gives you single record as below
df.dropDuplicates()
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
You can also specify the fields to be checked for duplicate as
df.dropDuplicates("id1", "id2", "id3", "value")
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
Output:
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+
You can distinct to get the only one row when it is duplicated.
df.distinct
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
+---+---+---+-----+
|id1|id2|id3|value|
+---+---+---+-----+
|1 |3 |2 |tom |
|2 |1 |2 |mary |
+---+---+---+-----+
You can also use the groupBy method.
df.groupBy("id1", "id2", "id3", "value")
.agg(first("col1").as("col1"), ...)
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
.drop("count")
.show(false)
I have 2 datasets with me as shown below. I'm trying to find out how many products are associated with each game. Basically, I'm trying to keep a count of the number of products associated.
scala> df1.show()
gameid | games | users | cnt_assoc_prod
-------------------------------------------
1 | cricket |[111, 121] |
2 | basketball|[211] |
3 | skating |[101, 100, 98] |
scala> df2.show()
user | products
----------------------
98 | "shampoo"
100 | "soap"
101 | "shampoo"
111 | "shoes"
121 | "honey"
211 | "shoes"
I'm trying to iterate through each of df1's users from the array and find the corresponding row in df2 by applying the filter on column matching the user.
df1.map{x => {
var assoc_products = new Set()
x.users.foreach(y => assoc_products + df2.filter(z => z.user == y).first().
products)
x.cnt_assoc_prod = assoc_products.size
}
While applying filter I get following Exception
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.logicalPlan(Dataset.scala:784)
at org.apache.spark.sql.Dataset.mapPartitions(Dataset.scala:344)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:307)
I'm using spark version 1.6.1.
You can explode the users column in df1, join with df2 on the user column, then do the groupBy count:
(df1.withColumn("user", explode(col("users")))
.join(df2, Seq("user"))
.groupBy("gameid", "games")
.agg(count($"products").alias("cnt_assoc_prod"))
).show
+------+----------+--------------+
|gameid| games|cnt_assoc_prod|
+------+----------+--------------+
| 3| skating| 3|
| 2|basketball| 1|
| 1| cricket| 2|
+------+----------+--------------+
Giving the following input table:
+----+------------+----------+
| id | shop | purchases|
+----+------------+----------+
| 1 | 01 | 20 |
| 1 | 02 | 31 |
| 2 | 03 | 5 |
| 1 | 03 | 3 |
+----+------------+----------+
I would like, grouping by id and based on the purchases, obtain the first 2 top shops as follow:
+----+-------+------+
| id | top_1 | top_2|
+----+-------+------+
| 1 | 02 | 01 |
| 2 | 03 | |
+----+-------+------+
I'm using Apache Spark 2.0.1 and the first table is the result of other queries and joins which are on a Dataset. I could maybe do this with the traditional java iterating over the Dataset, but I hope there is another way using the Dataset functionalities.
My first attempt was the following:
//dataset is already ordered by id, purchases desc
...
Dataset<Row> ds = dataset.repartition(new Column("id"));
ds.foreachPartition(new ForeachPartitionFunction<Row>() {
#Override
public void call(Iterator<Row> itrtr) throws Exception {
int counter = 0;
while (itrtr.hasNext()) {
Row row = itrtr.next();
if(counter < 2)
//save it into another Dataset
counter ++;
}
}
});
But then I were lost in how to save it into another Dataset. My goal is, at the end, save the result into a MySQL table.
Using window functions and pivot you can define a window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, row_number}
val w = Window.partitionBy(col("id")).orderBy(col("purchases").desc)
add row_number and filter top two rows:
val dataset = Seq(
(1, "01", 20), (1, "02", 31), (2, "03", 5), (1, "03", 3)
).toDF("id", "shop", "purchases")
val topTwo = dataset.withColumn("top", row_number.over(w)).where(col("top") <= 2)
and pivot:
topTwo.groupBy(col("id")).pivot("top", Seq(1, 2)).agg(first("shop"))
with result being:
+---+---+----+
| id| 1| 2|
+---+---+----+
| 1| 02| 01|
| 2| 03|null|
+---+---+----+
I'll leave converting syntax to Java as an exercise for the poster (excluding import static for functions the rest should be close to identical).
Table Name : Country
-----+-----------------+--------------------+-----------------
id | country_name | country_short_name | country_full_name
-----+-----------------+--------------------+-----------------
1 | Bagladesh | BD |Bagladesh
-----+-----------------+--------------------+-----------------
2 | Bagladesh | BCDD |sdriij
-----+-----------------+--------------------+-----------------
3 | India | IND |India
-----+-----------------+--------------------+-----------------
in laravel i update multiple row using
Country::where('country_name ', '=', "Bagladesh" )
->update(array('country_short_name' => "BD",
'country_full_name' => "Bangladesh",
));
i want to do using Hibernate
How could I get the values under specific column in JTable
example :
_________________________
| Column 1 | Column 2 |
________________________
| 1 | a |
________________________
| 2 | b |
________________________
| 3 | c |
_________________________
How could I get the values under Column 1 that is [1, 2, 3]
In the form of some data structure ( preferable array)?
you can do something like this
ArrayList list = new ArrayList();
for(int i = 0;i<table.getModel().getRowCount();i++)
{
list.add(table.getModel().getValueAt(i,0)); //get the all row values at column index 0
}