Add column to a Dataset based on the value from Another Dataset

Add column to a Dataset based on the value from Another Dataset - java

I have a dataset dsCustomer that have the customer details with columns
|customerID|idpt | totalAmount|
|customer1 | H1 | 250 |
|customer2 | H2 | 175 |
|customer3 | H3 | 4000 |
|customer4 | H3 | 9000 |
I have another dataset dsCategory that contains the category based on the amount sales
|categoryID|idpt | borne_min|borne_max|
|A | H2 | 0 |1000 |
|B | H2 | 1000 |5000 |
|C | H2 | 5000 |7000 |
|D | H2 | 7000 |10000 |
|F | H3 | 0 |1000 |
|G | H3 | 1000 |5000 |
|H | H3 | 5000 |7000 |
|I | H3 | 7000 |1000000 |
I would like to have a result that is taking the totalAmount of Customer and find the category.
|customerID|idpt |totalAmount|category|
|customer1 | H1 | 250 | null |
|customer2 | H2 | 175 | A |
|customer3 | H3 | 4000 | G |
|customer4 | H3 | 9000 | I |
//udf
public static Column getCategoryAmount(Dataset<Row> ds, Column amountColumn) {
return ds.filter(amountColumn.geq(col("borne_min"))
.and(amountColumn.lt(col("borne_max")))).first().getAs("categoryID");
}
//code to add column to my dataset
dsCustomer.withColumn("category", getCategoryAmount(dsCategory , dsCustomer.col("totalAmount")));
How can i pass the value of column from my dataset of customer to my UDF function
Because the error is showing that totalAmount is not contains in the category dataset
Question: How can i use Map to for each row in the dsCustomer i should go and check they value in dsCategory.
I have tried to join the 2 tables but it is working because the dsCustomer should maintain the same records just added the calulated column picked from dsCategory.
caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`totalAmount`' given input columns: [categoryID,borne_min,borne_max];;
'Filter (('totalAmount>= borne_min#220) && ('totalAmount < borne_max#221))

You have to join the two datasets. withColumn only allows modifications of the same Dataset.
UPDATE
I did not have time before to explain in detail what I mean. This is what I was trying to explain. You can join two dataframes. In your case you need a left join to preserve rows which don't have a matching category.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
cust = [
('customer1', 'H1', 250),
('customer2', 'H2', 175),
('customer3', 'H3', 4000),
('customer4', 'H3', 9000)
]
cust_df = spark.createDataFrame(cust, ['customerID', 'idpt', 'totalAmount'])
cust_df.show()
cat = [
('A', 'H2', 0 , 1000),
('B', 'H2', 1000, 5000),
('C', 'H2', 5000, 7000),
('D', 'H2', 7000, 10000),
('F', 'H3', 0 , 1000),
('G', 'H3', 1000, 5000),
('H', 'H3', 5000, 7000),
('I', 'H3', 7000, 1000000)
]
cat_df = spark.createDataFrame(cat, ['categoryID', 'idpt', 'borne_min', 'borne_max'])
cat_df.show()
cust_df.join(cat_df,
(cust_df.idpt == cat_df.idpt) &
(cust_df.totalAmount >= cat_df.borne_min) &
(cust_df.totalAmount <= cat_df.borne_max)
, how='left') \
.select(cust_df.customerID, cust_df.idpt, cust_df.totalAmount, cat_df.categoryID) \
.show()
Output
+----------+----+-----------+
|customerID|idpt|totalAmount|
+----------+----+-----------+
| customer1| H1| 250|
| customer2| H2| 175|
| customer3| H3| 4000|
| customer4| H3| 9000|
+----------+----+-----------+
+----------+----+---------+---------+
|categoryID|idpt|borne_min|borne_max|
+----------+----+---------+---------+
| A| H2| 0| 1000|
| B| H2| 1000| 5000|
| C| H2| 5000| 7000|
| D| H2| 7000| 10000|
| F| H3| 0| 1000|
| G| H3| 1000| 5000|
| H| H3| 5000| 7000|
| I| H3| 7000| 1000000|
+----------+----+---------+---------+
+----------+----+-----------+----------+
|customerID|idpt|totalAmount|categoryID|
+----------+----+-----------+----------+
| customer1| H1| 250| null|
| customer3| H3| 4000| G|
| customer4| H3| 9000| I|
| customer2| H2| 175| A|
+----------+----+-----------+----------+

Related

Grouping & data wrangling on certain conditions in a Spark dataframe

I have the below dataframe in spark
+---------+--------------+-------+------------+--------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+--------+
| 153|4512 | 30095|11272020 | 0|
| 153|4512 | 30096|11272020 | 30|
| 145|4513 | 40095|11272020 | 0|
| 135|4512 | 30096|11272020 | 0|
| 153|4512 | 30097|11272020 | 0|
| 145|4513 | 30094|11272020 | 0|
+---------+--------------+-------+------------+--------+
I need to group the records by pid, tid and date so after grouping the dataframe looks like
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 0 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 0|
| 145|4513 | 40095|11272020 | 0|
| 145|4513 | 30094|11272020 | 0|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
After grouping ,I need to check if any records in this group has an account in 30095 or 40095 then need to replace all the records in that group whose depId is 0 with first 4 digits of account , the expected outcome is
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 3009 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 3009|
| 145|4513 | 40095|11272020 | 4009|
| 145|4513 | 30094|11272020 | 4009|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
I tried the below code but it is not working for me
WindowSpec windowSpec = Window.partitionBy("pid","tid","date").orderBy("account");
Column roworder = rank().over(windowSpec).as("rank");
Dataset<Row> df1 = df.select(df.col("*"),roworder);
Dataset<Row> df2 = df1.withColumn("depid1",
.when(df1.col("account").equalTo("40095").and(df1.col("depid").equalTo("0")), 4009)
.when(df1.col("rank").gt(1).and(df1.col("depid").equalTo("0")), 4009)
.when(df1.col("account").equalTo("30095").and(df1.col("depid").equalTo("0")), 3009)
.when(df1.col("rank").gt(1).and(df1.col("depid").equalTo("0")), 3009)
.otherwise(df1.col("depid"))
).orderBy(col("pid").desc()).drop("depid1").withColumnRenamed("sourcedid1","depid")
but it is producing the below output as
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 3009 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 4009|
| 145|4513 | 40095|11272020 | 4009|
| 145|4513 | 30094|11272020 | 4009|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
I am not sure what am I doing incorrectly here

You will need to convert to JAVA. I suggest you use the Scala API, it makes life far easier. Also, you may have different data types.
Here is my alternative which I see more as a data analysis task. I added some extra records to demonstrate the point and make more generic and robust. I do not think your approach is sound enough. Anyway, we can all learn.
So, here goes:
import org.apache.spark.sql.functions._
///...
// More a data analysys problem.
// 1. Gen sample data.
val df = Seq( ( 153, 4512, "30095", "11272020", 0 ),
( 153, 4512, "30096", "11272020", 30 ),
( 153, 4512, "30096", "11272020", 30 ), // extra record
( 145, 4513, "40095", "11272020", 0 ),
( 145, 4513, "40095", "11272020", 0 ), // extra record
( 145, 4513, "40095", "11272020", 200 ), // extra record
( 135, 4512, "30096", "11272020", 0 ),
( 153, 4512, "30097", "11272020", 0 ),
( 145, 4513, "30094", "11272020", 0 )
).toDF("pid","tid","account","date","depid")
df.show()
// 2. Get the groups with accounts of relevance. Note they may have records not needing to be processed.
val dfg = df.filter(df("account").isin("30095", "40095")).select("pid","tid","date").distinct().toDF("pidg", "tidg", "dateg")
dfg.show()
// 3. Get the data that needs to be processed. Take into account performance.
val dfp = df.as("df").join(dfg.as("dfg"), $"df.pid" === $"dfg.pidg" && $"df.tid" === $"dfg.tidg" && $"df.date" === $"dfg.dateg" && $"df.depid" === 0, "inner")
.drop("pidg").drop("tidg").drop("dateg")
dfp.show()
// 4. Get records that need not be processed for later UNION operation.
val res1 = df.exceptAll(dfp)
res1.show()
// 5. Process those records needed.
val res2 = dfp.withColumn("depid2", substring(col("account"), 0, 4).cast("int")).drop("depid").toDF("pid","tid","account","date","depid")
res2.show()
// 6. Final result.
val res = res1.union(res2)
res.show()
results finally in, in a performant way:
+---+----+-------+--------+-----+
|pid| tid|account| date|depid|
+---+----+-------+--------+-----+
|153|4512| 30096|11272020| 30|
|153|4512| 30096|11272020| 30|
|145|4513| 40095|11272020| 200|
|135|4512| 30096|11272020| 0|
|153|4512| 30095|11272020| 3009|
|145|4513| 40095|11272020| 4009|
|145|4513| 40095|11272020| 4009|
|153|4512| 30097|11272020| 3009|
|145|4513| 30094|11272020| 3009|
+---+----+-------+--------+-----+

Spark SQL: keep a non-key row after join

I have two dataset as following:
smoothieDs.show()
|smoothie_id | smoothie | price |
|1 | Tropical | 10 |
|2 | Green vegie | 20 |
and:
ingredientDs.show()
|smoothie | ingredient |
|Tropical | Mango |
|Tropical | Passion fruit |
|Green veggie | Cucumber |
|Green veggie | Kiwi |
I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field.
I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives:
|smoothie_id |price | smoothie | ingredient |
|1 |10 | Tropical | Mango |
|1 |10 | Tropical | Passion fruit |
But my expected result should be:
|smoothie_id |price | smoothie | ingredient |
|1 |10 | Tropical | Mango |
|1 |10 | Tropical | Passion fruit |
|2 |20 | Green veggie | To be communicated |
Is it possible to achieve this using join directly, if not what is the best way to achieve this ?

You can replace the ingredient based on the price after the join:
import org.apache.spark.sql.functions._
smoothieDs.join(ingredientDs, "smoothie")
.withColumn("ingredient", when('price.lt(15), 'ingredient).otherwise("To be communicated"))
.distinct()
.show()
Output:
+------------+-----------+-----+------------------+
| smoothie|smoothie_id|price| ingredient|
+------------+-----------+-----+------------------+
|Green veggie| 2| 20|To be communicated|
| Tropical| 1| 10| Mango|
| Tropical| 1| 10| Passion fruit|
+------------+-----------+-----+------------------+
Edit: another option would be to filter the ingredient dataset first and then do the join. This would avoid using distinct but comes at the price of a second join. Depending on the data this can or can not be faster.
smoothieDs.join(
ingredientDs.join(smoothieDs.filter('price.lt(15)), Seq("smoothie"), "left_semi"),
Seq("smoothie"), "left_outer")
.na.fill("To be communicated", Seq("ingredient"))
.show()

Count distinct while aggregating others?

This is how my dataset looks like:
+---------+------------+-----------------+
| name |request_type| request_group_id|
+---------+------------+-----------------+
|Michael | X | 1020 |
|Michael | X | 1018 |
|Joe | Y | 1018 |
|Sam | X | 1018 |
|Michael | Y | 1021 |
|Sam | X | 1030 |
|Elizabeth| Y | 1035 |
+---------+------------+-----------------+
I want to calculate the amount of request_type's per person and count unique request_group_id's
Result should be following:
+---------+--------------------+---------------------+--------------------------------+
| name |cnt(request_type(X))| cnt(request_type(Y))| cnt(distinct(request_group_id))|
+---------+--------------------+---------------------+--------------------------------+
|Michael | 2 | 1 | 3 |
|Joe | 0 | 1 | 1 |
|Sam | 2 | 0 | 2 |
|John | 1 | 0 | 1 |
|Elizabeth| 0 | 1 | 1 |
+---------+--------------------+---------------------+--------------------------------+
What I've done so far: (helps to derive first two columns)
msgDataFrame.select(NAME, REQUEST_TYPE)
.groupBy(NAME)
.pivot(REQUEST_TYPE, Lists.newArrayList(X, Y))
.agg(functions.count(REQUEST_TYPE))
.show();
How to count distinct request_group_id's in this select? Is it possible to do within it?
I think it's possible only via two datasets join (my current result + separate aggregation by distinct request_group_id)

Example with "countDistinct" ("countDistinct" is not worked over window, replaced with "size","collect_set"):
val groupIdWindow = Window.partitionBy("name")
df.select($"name", $"request_type",
size(collect_set("request_group_id").over(groupIdWindow)).alias("countDistinct"))
.groupBy("name", "countDistinct")
.pivot($"request_type", Seq("X", "Y"))
.agg(count("request_type"))
.show(false)

How to perform a query using a field that is a merge of 2 columns?

I'm building up a series of distribution analysis using Java Spark library. This is the actual code I'm using to fetch the data from a JSON file and save the output.
Dataset<Row> dataset = spark.read().json("local/foods.json");
dataset.createOrReplaceTempView("cs_food");
List<GenericAnalyticsEntry> menu_distribution= spark
.sql(" ****REQUESTED QUERY ****")
.toJavaRDD()
.map(row -> Triple.of( row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
.map(GenericAnalyticsEntry::of)
.collect();
writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);
The query I'm looking for is based on this structure:
+------------+-------------+------------+------------+
| FIRST_FOOD | SECOND_FOOD | DATE | IS_SPECIAL |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 11/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Lasagna | Pizza | 12/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Spaghetti | 13/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 14/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Lasagna | 15/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pork | Mozzarella | 16/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Lasagna | Mozzarella | 17/02/2017 | FALSE |
+------------+-------------+------------+------------+
How can I achieve this (written below) output from the code written above?
+------------+--------------------+----------------------+
| FOODS | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza | 2 | 1 |
+------------+--------------------+----------------------+
| Lasagna | 2 | 1 |
+------------+--------------------+----------------------+
| Spaghetti | 2 | 3 |
+------------+--------------------+----------------------+
| Mozzarella | 0 | 2 |
+------------+--------------------+----------------------+
| Pork | 1 | 0 |
+------------+--------------------+----------------------+
I've of course tried to figure out a solution by myself but had no luck with the my tries, I may be wrong, but I need something like this:
"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"

From the example data, this looks like it will produce the output you want:
select
foods,
first_count,
second_count
from
(select first_food as food from menus
union select second_food from menus) as f
left join (
select first_food, count(*) as first_count from menus
group by first_food
) as ff on ff.first_food=f.food
left join (
select second_food, count(*) as second_count from menus
group by second_food
) as sf on sf.second_food=f.food
;

Simple combination of flatMap and groupBy should do the job like this (sorry, can't check if it 100% correct right now):
import spark.sqlContext.implicits._
val df = Seq(("Pizza", "Pasta"), ("Pizza", "Soup")).toDF("first", "second")
df.flatMap({case Row(first: String, second: String) => Seq((first, 1, 0), (second, 0, 1))})
.groupBy("_1")

Spark - sample() function duplicating data?

I want to randomly select a subset of my data and then limit it to 200 entries. But after using the sample() function, I'm getting duplicate rows, and I don't know why. Let me show you:
DataFrame df= sqlContext.sql("SELECT * " +
" FROM temptable" +
" WHERE conditions");
DataFrame df1 = df.select(df.col("col1"))
.where(df.col("col1").isNotNull())
.distinct()
.orderBy(df.col("col1"));
df.show();
System.out.println(df.count());
Up until now, everything is OK. I get the output:
+-----------+
|col1 |
+-----------+
| 10016|
| 10022|
| 100281|
| 10032|
| 100427|
| 100445|
| 10049|
| 10070|
| 10076|
| 10079|
| 10081|
| 10082|
| 100884|
| 10092|
| 10099|
| 10102|
| 10103|
| 101039|
| 101134|
| 101187|
+-----------+
only showing top 20 rows
10512
with 10512 records without duplicates. AND THEN!
df = df.sample(true, 0.5).limit(200);
df.show();
System.out.println(users.count());
This returns 200 rows full of duplicates:
+-----------+
|col1 |
+-----------+
| 10022|
| 100445|
| 100445|
| 10049|
| 10079|
| 10079|
| 10081|
| 10081|
| 10082|
| 10092|
| 10102|
| 10102|
| 101039|
| 101134|
| 101134|
| 101134|
| 101345|
| 101345|
| 10140|
| 10141|
+-----------+
only showing top 20 rows
200
Can anyone tell me why? This is driving me crazy. Thank you!

You explicitly ask for a sample with replacement so there is nothing unexpected about getting duplicates:
public Dataset<T> sample(boolean withReplacement, double fraction)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Add column to a Dataset based on the value from Another Dataset - java

Related

Grouping & data wrangling on certain conditions in a Spark dataframe

Spark SQL: keep a non-key row after join

Count distinct while aggregating others?

How to perform a query using a field that is a merge of 2 columns?

Spark - sample() function duplicating data?

Categories

Resources