Spark SQL: keep a non-key row after join

Spark SQL: keep a non-key row after join - java

I have two dataset as following:
smoothieDs.show()
|smoothie_id | smoothie | price |
|1 | Tropical | 10 |
|2 | Green vegie | 20 |
and:
ingredientDs.show()
|smoothie | ingredient |
|Tropical | Mango |
|Tropical | Passion fruit |
|Green veggie | Cucumber |
|Green veggie | Kiwi |
I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field.
I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives:
|smoothie_id |price | smoothie | ingredient |
|1 |10 | Tropical | Mango |
|1 |10 | Tropical | Passion fruit |
But my expected result should be:
|smoothie_id |price | smoothie | ingredient |
|1 |10 | Tropical | Mango |
|1 |10 | Tropical | Passion fruit |
|2 |20 | Green veggie | To be communicated |
Is it possible to achieve this using join directly, if not what is the best way to achieve this ?

You can replace the ingredient based on the price after the join:
import org.apache.spark.sql.functions._
smoothieDs.join(ingredientDs, "smoothie")
.withColumn("ingredient", when('price.lt(15), 'ingredient).otherwise("To be communicated"))
.distinct()
.show()
Output:
+------------+-----------+-----+------------------+
| smoothie|smoothie_id|price| ingredient|
+------------+-----------+-----+------------------+
|Green veggie| 2| 20|To be communicated|
| Tropical| 1| 10| Mango|
| Tropical| 1| 10| Passion fruit|
+------------+-----------+-----+------------------+
Edit: another option would be to filter the ingredient dataset first and then do the join. This would avoid using distinct but comes at the price of a second join. Depending on the data this can or can not be faster.
smoothieDs.join(
ingredientDs.join(smoothieDs.filter('price.lt(15)), Seq("smoothie"), "left_semi"),
Seq("smoothie"), "left_outer")
.na.fill("To be communicated", Seq("ingredient"))
.show()

Related

Count distinct while aggregating others?

This is how my dataset looks like:
+---------+------------+-----------------+
| name |request_type| request_group_id|
+---------+------------+-----------------+
|Michael | X | 1020 |
|Michael | X | 1018 |
|Joe | Y | 1018 |
|Sam | X | 1018 |
|Michael | Y | 1021 |
|Sam | X | 1030 |
|Elizabeth| Y | 1035 |
+---------+------------+-----------------+
I want to calculate the amount of request_type's per person and count unique request_group_id's
Result should be following:
+---------+--------------------+---------------------+--------------------------------+
| name |cnt(request_type(X))| cnt(request_type(Y))| cnt(distinct(request_group_id))|
+---------+--------------------+---------------------+--------------------------------+
|Michael | 2 | 1 | 3 |
|Joe | 0 | 1 | 1 |
|Sam | 2 | 0 | 2 |
|John | 1 | 0 | 1 |
|Elizabeth| 0 | 1 | 1 |
+---------+--------------------+---------------------+--------------------------------+
What I've done so far: (helps to derive first two columns)
msgDataFrame.select(NAME, REQUEST_TYPE)
.groupBy(NAME)
.pivot(REQUEST_TYPE, Lists.newArrayList(X, Y))
.agg(functions.count(REQUEST_TYPE))
.show();
How to count distinct request_group_id's in this select? Is it possible to do within it?
I think it's possible only via two datasets join (my current result + separate aggregation by distinct request_group_id)

Example with "countDistinct" ("countDistinct" is not worked over window, replaced with "size","collect_set"):
val groupIdWindow = Window.partitionBy("name")
df.select($"name", $"request_type",
size(collect_set("request_group_id").over(groupIdWindow)).alias("countDistinct"))
.groupBy("name", "countDistinct")
.pivot($"request_type", Seq("X", "Y"))
.agg(count("request_type"))
.show(false)

Add column to a Dataset based on the value from Another Dataset

I have a dataset dsCustomer that have the customer details with columns
|customerID|idpt | totalAmount|
|customer1 | H1 | 250 |
|customer2 | H2 | 175 |
|customer3 | H3 | 4000 |
|customer4 | H3 | 9000 |
I have another dataset dsCategory that contains the category based on the amount sales
|categoryID|idpt | borne_min|borne_max|
|A | H2 | 0 |1000 |
|B | H2 | 1000 |5000 |
|C | H2 | 5000 |7000 |
|D | H2 | 7000 |10000 |
|F | H3 | 0 |1000 |
|G | H3 | 1000 |5000 |
|H | H3 | 5000 |7000 |
|I | H3 | 7000 |1000000 |
I would like to have a result that is taking the totalAmount of Customer and find the category.
|customerID|idpt |totalAmount|category|
|customer1 | H1 | 250 | null |
|customer2 | H2 | 175 | A |
|customer3 | H3 | 4000 | G |
|customer4 | H3 | 9000 | I |
//udf
public static Column getCategoryAmount(Dataset<Row> ds, Column amountColumn) {
return ds.filter(amountColumn.geq(col("borne_min"))
.and(amountColumn.lt(col("borne_max")))).first().getAs("categoryID");
}
//code to add column to my dataset
dsCustomer.withColumn("category", getCategoryAmount(dsCategory , dsCustomer.col("totalAmount")));
How can i pass the value of column from my dataset of customer to my UDF function
Because the error is showing that totalAmount is not contains in the category dataset
Question: How can i use Map to for each row in the dsCustomer i should go and check they value in dsCategory.
I have tried to join the 2 tables but it is working because the dsCustomer should maintain the same records just added the calulated column picked from dsCategory.
caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`totalAmount`' given input columns: [categoryID,borne_min,borne_max];;
'Filter (('totalAmount>= borne_min#220) && ('totalAmount < borne_max#221))

You have to join the two datasets. withColumn only allows modifications of the same Dataset.
UPDATE
I did not have time before to explain in detail what I mean. This is what I was trying to explain. You can join two dataframes. In your case you need a left join to preserve rows which don't have a matching category.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
cust = [
('customer1', 'H1', 250),
('customer2', 'H2', 175),
('customer3', 'H3', 4000),
('customer4', 'H3', 9000)
]
cust_df = spark.createDataFrame(cust, ['customerID', 'idpt', 'totalAmount'])
cust_df.show()
cat = [
('A', 'H2', 0 , 1000),
('B', 'H2', 1000, 5000),
('C', 'H2', 5000, 7000),
('D', 'H2', 7000, 10000),
('F', 'H3', 0 , 1000),
('G', 'H3', 1000, 5000),
('H', 'H3', 5000, 7000),
('I', 'H3', 7000, 1000000)
]
cat_df = spark.createDataFrame(cat, ['categoryID', 'idpt', 'borne_min', 'borne_max'])
cat_df.show()
cust_df.join(cat_df,
(cust_df.idpt == cat_df.idpt) &
(cust_df.totalAmount >= cat_df.borne_min) &
(cust_df.totalAmount <= cat_df.borne_max)
, how='left') \
.select(cust_df.customerID, cust_df.idpt, cust_df.totalAmount, cat_df.categoryID) \
.show()
Output
+----------+----+-----------+
|customerID|idpt|totalAmount|
+----------+----+-----------+
| customer1| H1| 250|
| customer2| H2| 175|
| customer3| H3| 4000|
| customer4| H3| 9000|
+----------+----+-----------+
+----------+----+---------+---------+
|categoryID|idpt|borne_min|borne_max|
+----------+----+---------+---------+
| A| H2| 0| 1000|
| B| H2| 1000| 5000|
| C| H2| 5000| 7000|
| D| H2| 7000| 10000|
| F| H3| 0| 1000|
| G| H3| 1000| 5000|
| H| H3| 5000| 7000|
| I| H3| 7000| 1000000|
+----------+----+---------+---------+
+----------+----+-----------+----------+
|customerID|idpt|totalAmount|categoryID|
+----------+----+-----------+----------+
| customer1| H1| 250| null|
| customer3| H3| 4000| G|
| customer4| H3| 9000| I|
| customer2| H2| 175| A|
+----------+----+-----------+----------+

How to perform a query using a field that is a merge of 2 columns?

I'm building up a series of distribution analysis using Java Spark library. This is the actual code I'm using to fetch the data from a JSON file and save the output.
Dataset<Row> dataset = spark.read().json("local/foods.json");
dataset.createOrReplaceTempView("cs_food");
List<GenericAnalyticsEntry> menu_distribution= spark
.sql(" ****REQUESTED QUERY ****")
.toJavaRDD()
.map(row -> Triple.of( row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
.map(GenericAnalyticsEntry::of)
.collect();
writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);
The query I'm looking for is based on this structure:
+------------+-------------+------------+------------+
| FIRST_FOOD | SECOND_FOOD | DATE | IS_SPECIAL |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 11/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Lasagna | Pizza | 12/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Spaghetti | 13/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 14/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Lasagna | 15/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pork | Mozzarella | 16/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Lasagna | Mozzarella | 17/02/2017 | FALSE |
+------------+-------------+------------+------------+
How can I achieve this (written below) output from the code written above?
+------------+--------------------+----------------------+
| FOODS | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza | 2 | 1 |
+------------+--------------------+----------------------+
| Lasagna | 2 | 1 |
+------------+--------------------+----------------------+
| Spaghetti | 2 | 3 |
+------------+--------------------+----------------------+
| Mozzarella | 0 | 2 |
+------------+--------------------+----------------------+
| Pork | 1 | 0 |
+------------+--------------------+----------------------+
I've of course tried to figure out a solution by myself but had no luck with the my tries, I may be wrong, but I need something like this:
"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"

From the example data, this looks like it will produce the output you want:
select
foods,
first_count,
second_count
from
(select first_food as food from menus
union select second_food from menus) as f
left join (
select first_food, count(*) as first_count from menus
group by first_food
) as ff on ff.first_food=f.food
left join (
select second_food, count(*) as second_count from menus
group by second_food
) as sf on sf.second_food=f.food
;

Simple combination of flatMap and groupBy should do the job like this (sorry, can't check if it 100% correct right now):
import spark.sqlContext.implicits._
val df = Seq(("Pizza", "Pasta"), ("Pizza", "Soup")).toDF("first", "second")
df.flatMap({case Row(first: String, second: String) => Seq((first, 1, 0), (second, 0, 1))})
.groupBy("_1")

Spark - sample() function duplicating data?

I want to randomly select a subset of my data and then limit it to 200 entries. But after using the sample() function, I'm getting duplicate rows, and I don't know why. Let me show you:
DataFrame df= sqlContext.sql("SELECT * " +
" FROM temptable" +
" WHERE conditions");
DataFrame df1 = df.select(df.col("col1"))
.where(df.col("col1").isNotNull())
.distinct()
.orderBy(df.col("col1"));
df.show();
System.out.println(df.count());
Up until now, everything is OK. I get the output:
+-----------+
|col1 |
+-----------+
| 10016|
| 10022|
| 100281|
| 10032|
| 100427|
| 100445|
| 10049|
| 10070|
| 10076|
| 10079|
| 10081|
| 10082|
| 100884|
| 10092|
| 10099|
| 10102|
| 10103|
| 101039|
| 101134|
| 101187|
+-----------+
only showing top 20 rows
10512
with 10512 records without duplicates. AND THEN!
df = df.sample(true, 0.5).limit(200);
df.show();
System.out.println(users.count());
This returns 200 rows full of duplicates:
+-----------+
|col1 |
+-----------+
| 10022|
| 100445|
| 100445|
| 10049|
| 10079|
| 10079|
| 10081|
| 10081|
| 10082|
| 10092|
| 10102|
| 10102|
| 101039|
| 101134|
| 101134|
| 101134|
| 101345|
| 101345|
| 10140|
| 10141|
+-----------+
only showing top 20 rows
200
Can anyone tell me why? This is driving me crazy. Thank you!

You explicitly ask for a sample with replacement so there is nothing unexpected about getting duplicates:
public Dataset<T> sample(boolean withReplacement, double fraction)

MySQL SUBTRACT with SUM Query for One Single Column Same Table

Transactions_Table:
+---------+--------+-------------+--------------+-----+
| DocType | SFCode | Productname | WarrantyCode | QTY |
+---------+--------+-------------+--------------+-----+
| FP | 12 | Item | 1111-01 | 100 | -100
| FP | 12 | Item | 2222-22 | 200 |
| FP | 12 | Item | 3333-33 | 350 | -350
| LP | 12 | Item | 4444-44 | 10 |
| LP | 12 | Item | 5555-55 | 20 |
| LP | 12 | Item | 6666-66 | 35 | -35
| CAS | 12 | Item | 1111-01 | 50 | -(50 Left, show)
| CRS | 12 | Item | 3333-33 | 120 | -(230 Left, show)
| CRS | 12 | Item | 6666-66 | 35 | -(0 Left, no show)
| FPR | 12 | Item | 1111-01 | 10 | -(40 Left, show)
| LPR | 12 | Item | 5555-55 | 20 | -(0 Left, no show)
| CSR | 12 | Item | 1111-01 | 5 | -(50+5 Left, show)
| CRR | 12 | Item | 6666-66 | 5 | -(Got back 5, show)
+---------+--------+-------------+--------------+-----|
KEY:
FP: Foreign Purchase
LP: Local Purchase
CAS: Cash Sale
CRS: Credit Sale
FPR: Foreign Purchase Return
LPR: Local Purchase Return
CSR: Cash Sale Return
CRR: Credit Sale Return
There are many products but for now focussing on a single SFCode "12".
QTY is the Physical Stock PRESENT in the store, and the DocType are the transactions.
There are 2 Things I need to do with this table.
Get Current Stock which is (FP+LP+CSR+CRR) - (FPR+LPR+CAS+CRS) Note: There maybe no transaction of a particular DocType
Get Warranty Code(s) for a Product which has not been Sold Out for a particular Warranty Code. Go from Top to Bottom in Table last Column (not named) and you will get the idea.
Please suggest Java-MySql statement(s) that will help me achieve this result. Any help is appreciated.

Try something like this for #1:
SELECT SFCode, SUM(FP+LP+CSR+CRR-FPR-LPR-CAS-CRS) AS Total FROM
(SELECT SFCode,
SUM(IF(DocType = "FP", QTY, 0)) AS FP,
>>please fill out all the columns<<
FROM Transactions_Table
WHERE SFCode = "12"
GROUP BY DocType);
This is my shot at #2: (This assumes SFCode isn't an integer)
SELECT a.SFCode, a.WarrantyCode, (a.QTY-b.QTY) AS Stock FROM
(SELECT SFCode, WarrantyCode, QTY
FROM Transactions_Table
WHERE SFCode = "12"
AND DocType IN ('FP','LP','CSR','CRR')
GROUP BY WarrantyCode) AS a
LEFT JOIN
(SELECT SFCode, WarrantyCode, QTY
FROM Transactions_Table
WHERE SFCode = "12"
AND DocType IN ('FPR','LPR','CAS','CRS')
GROUP BY WarrantyCode) AS b
ON a.SFCode = b.SFCode AND a.WarrantyCode = b.WarrantyCode;
Can't really test this myself right now but this should at least give you an idea.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark SQL: keep a non-key row after join - java

Related

Count distinct while aggregating others?

Add column to a Dataset based on the value from Another Dataset

How to perform a query using a field that is a merge of 2 columns?

Spark - sample() function duplicating data?

MySQL SUBTRACT with SUM Query for One Single Column Same Table

Categories

Resources