I have a dataset ds like this:
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate rows (i.e. row 1 and row 2) for the given keys (id1,id2,id3), but at the same time only keep one row for duplicated rows with same value (i.e. row 3 and row 4). The expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove row 1 and row 2 because we have 2 values for the key group. But we keep only one row for row 3 and row 4 because the value is the same (instead of removing these two rows)
I try to achieve this using:
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.distinct().withColumn("count", count("value").over(window))
.filter($"count" < 2)
This will return the correct result I want. However, I want to use groupBy() to achieve the same (instead of Window), but I don't know how to count when doing the groupBy() at the same time ?
Here is how you can do it with groupBy, count and first function
.groupBy("id1", "id2", "id3")
.agg(count("value").as("count"), first("value").as("value"))
.filter($"count" < 2 )
|1 |3 |2 |tom |
|2 |1 |2 |mary |
I have a dataset ds like this:
id1 | id2 | id3 | value |
1 | 1 | 2 | tom |
1 | 1 | 2 | tim |
1 | 3 | 2 | tom |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
I want to remove all duplicate rows (i.e. row 1 and row 2) for the given keys (id1,id2,id3), but at the same time only keep one row for duplicated rows with same value (i.e. row 3 and row 4). The expected output is:
id1 | id2 | id3 | value |
1 | 3 | 2 | tom |
2 | 1 | 2 | mary |
here I should remove row 1 and row 2 because we have 2 values for the key group. But we keep only one row for row 3 and row 4 because the value is the same (instead of removing these two rows)
I try to achieve this using:
val df = Seq(
(1, 1, 2, "tom"),
(1, 1, 2, "tim"),
(1, 3, 2, "tom"),
(1, 3, 2, "tom"),
(2, 1, 2, "mary")
).toDF("id1", "id2", "id3", "value")
val window = Window.partitionBy("id1", "id2", "id3")
df.withColumn("count", count("value").over(window))
.filter($"count" < 2)
Here is the related question:
Spark: remove all duplicated lines
But it's not working as expected because it will remove all the duplicated rows.
The reason that I want to do this is to join with another dataset, and not adding information from this dataset when we have multiple names for a same key group
You can drop duplicates before grouping, which gives you single record as below
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
You can also specify the fields to be checked for duplicate as
df.dropDuplicates("id1", "id2", "id3", "value")
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
|1 |3 |2 |tom |
|2 |1 |2 |mary |
You can distinct to get the only one row when it is duplicated.
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
|1 |3 |2 |tom |
|2 |1 |2 |mary |
You can also use the groupBy method.
df.groupBy("id1", "id2", "id3", "value")
.agg(first("col1").as("col1"), ...)
.withColumn("count", count("value").over(window))
.filter($"count" < 2)
I have 2 datasets with me as shown below. I'm trying to find out how many products are associated with each game. Basically, I'm trying to keep a count of the number of products associated.
scala> df1.show()
gameid | games | users | cnt_assoc_prod
1 | cricket |[111, 121] |
2 | basketball|[211] |
3 | skating |[101, 100, 98] |
scala> df2.show()
user | products
98 | "shampoo"
100 | "soap"
101 | "shampoo"
111 | "shoes"
121 | "honey"
211 | "shoes"
I'm trying to iterate through each of df1's users from the array and find the corresponding row in df2 by applying the filter on column matching the user.
df1.map{x => {
var assoc_products = new Set()
x.users.foreach(y => assoc_products + df2.filter(z => z.user == y).first().
x.cnt_assoc_prod = assoc_products.size
While applying filter I get following Exception
at org.apache.spark.sql.Dataset.logicalPlan(Dataset.scala:784)
at org.apache.spark.sql.Dataset.mapPartitions(Dataset.scala:344)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:307)
I'm using spark version 1.6.1.
You can explode the users column in df1, join with df2 on the user column, then do the groupBy count:
(df1.withColumn("user", explode(col("users")))
.join(df2, Seq("user"))
.groupBy("gameid", "games")
|gameid| games|cnt_assoc_prod|
| 3| skating| 3|
| 2|basketball| 1|
| 1| cricket| 2|
Giving the following input table:
| id | shop | purchases|
| 1 | 01 | 20 |
| 1 | 02 | 31 |
| 2 | 03 | 5 |
| 1 | 03 | 3 |
I would like, grouping by id and based on the purchases, obtain the first 2 top shops as follow:
| id | top_1 | top_2|
| 1 | 02 | 01 |
| 2 | 03 | |
I'm using Apache Spark 2.0.1 and the first table is the result of other queries and joins which are on a Dataset. I could maybe do this with the traditional java iterating over the Dataset, but I hope there is another way using the Dataset functionalities.
My first attempt was the following:
//dataset is already ordered by id, purchases desc
Dataset<Row> ds = dataset.repartition(new Column("id"));
ds.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> itrtr) throws Exception {
int counter = 0;
while (itrtr.hasNext()) {
Row row = itrtr.next();
if(counter < 2)
//save it into another Dataset
counter ++;
But then I were lost in how to save it into another Dataset. My goal is, at the end, save the result into a MySQL table.
Using window functions and pivot you can define a window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, row_number}
val w = Window.partitionBy(col("id")).orderBy(col("purchases").desc)
add row_number and filter top two rows:
val dataset = Seq(
(1, "01", 20), (1, "02", 31), (2, "03", 5), (1, "03", 3)
).toDF("id", "shop", "purchases")
val topTwo = dataset.withColumn("top", row_number.over(w)).where(col("top") <= 2)
and pivot:
topTwo.groupBy(col("id")).pivot("top", Seq(1, 2)).agg(first("shop"))
with result being:
| id| 1| 2|
| 1| 02| 01|
| 2| 03|null|
I'll leave converting syntax to Java as an exercise for the poster (excluding import static for functions the rest should be close to identical).
Table Name : Country
id | country_name | country_short_name | country_full_name
1 | Bagladesh | BD |Bagladesh
2 | Bagladesh | BCDD |sdriij
3 | India | IND |India
in laravel i update multiple row using
Country::where('country_name ', '=', "Bagladesh" )
->update(array('country_short_name' => "BD",
'country_full_name' => "Bangladesh",
i want to do using Hibernate
How could I get the values under specific column in JTable
example :
| Column 1 | Column 2 |
| 1 | a |
| 2 | b |
| 3 | c |
How could I get the values under Column 1 that is [1, 2, 3]
In the form of some data structure ( preferable array)?
you can do something like this
ArrayList list = new ArrayList();
for(int i = 0;i<table.getModel().getRowCount();i++)
list.add(table.getModel().getValueAt(i,0)); //get the all row values at column index 0