Grouping & data wrangling on certain conditions in a Spark dataframe - java

I have the below dataframe in spark
+---------+--------------+-------+------------+--------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+--------+
| 153|4512 | 30095|11272020 | 0|
| 153|4512 | 30096|11272020 | 30|
| 145|4513 | 40095|11272020 | 0|
| 135|4512 | 30096|11272020 | 0|
| 153|4512 | 30097|11272020 | 0|
| 145|4513 | 30094|11272020 | 0|
+---------+--------------+-------+------------+--------+
I need to group the records by pid, tid and date so after grouping the dataframe looks like
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 0 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 0|
| 145|4513 | 40095|11272020 | 0|
| 145|4513 | 30094|11272020 | 0|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
After grouping ,I need to check if any records in this group has an account in 30095 or 40095 then need to replace all the records in that group whose depId is 0 with first 4 digits of account , the expected outcome is
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 3009 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 3009|
| 145|4513 | 40095|11272020 | 4009|
| 145|4513 | 30094|11272020 | 4009|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
I tried the below code but it is not working for me
WindowSpec windowSpec = Window.partitionBy("pid","tid","date").orderBy("account");
Column roworder = rank().over(windowSpec).as("rank");
Dataset<Row> df1 = df.select(df.col("*"),roworder);
Dataset<Row> df2 = df1.withColumn("depid1",
.when(df1.col("account").equalTo("40095").and(df1.col("depid").equalTo("0")), 4009)
.when(df1.col("rank").gt(1).and(df1.col("depid").equalTo("0")), 4009)
.when(df1.col("account").equalTo("30095").and(df1.col("depid").equalTo("0")), 3009)
.when(df1.col("rank").gt(1).and(df1.col("depid").equalTo("0")), 3009)
.otherwise(df1.col("depid"))
).orderBy(col("pid").desc()).drop("depid1").withColumnRenamed("sourcedid1","depid")
but it is producing the below output as
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 3009 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 4009|
| 145|4513 | 40095|11272020 | 4009|
| 145|4513 | 30094|11272020 | 4009|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
I am not sure what am I doing incorrectly here

You will need to convert to JAVA. I suggest you use the Scala API, it makes life far easier. Also, you may have different data types.
Here is my alternative which I see more as a data analysis task. I added some extra records to demonstrate the point and make more generic and robust. I do not think your approach is sound enough. Anyway, we can all learn.
So, here goes:
import org.apache.spark.sql.functions._
///...
// More a data analysys problem.
// 1. Gen sample data.
val df = Seq( ( 153, 4512, "30095", "11272020", 0 ),
( 153, 4512, "30096", "11272020", 30 ),
( 153, 4512, "30096", "11272020", 30 ), // extra record
( 145, 4513, "40095", "11272020", 0 ),
( 145, 4513, "40095", "11272020", 0 ), // extra record
( 145, 4513, "40095", "11272020", 200 ), // extra record
( 135, 4512, "30096", "11272020", 0 ),
( 153, 4512, "30097", "11272020", 0 ),
( 145, 4513, "30094", "11272020", 0 )
).toDF("pid","tid","account","date","depid")
df.show()
// 2. Get the groups with accounts of relevance. Note they may have records not needing to be processed.
val dfg = df.filter(df("account").isin("30095", "40095")).select("pid","tid","date").distinct().toDF("pidg", "tidg", "dateg")
dfg.show()
// 3. Get the data that needs to be processed. Take into account performance.
val dfp = df.as("df").join(dfg.as("dfg"), $"df.pid" === $"dfg.pidg" && $"df.tid" === $"dfg.tidg" && $"df.date" === $"dfg.dateg" && $"df.depid" === 0, "inner")
.drop("pidg").drop("tidg").drop("dateg")
dfp.show()
// 4. Get records that need not be processed for later UNION operation.
val res1 = df.exceptAll(dfp)
res1.show()
// 5. Process those records needed.
val res2 = dfp.withColumn("depid2", substring(col("account"), 0, 4).cast("int")).drop("depid").toDF("pid","tid","account","date","depid")
res2.show()
// 6. Final result.
val res = res1.union(res2)
res.show()
results finally in, in a performant way:
+---+----+-------+--------+-----+
|pid| tid|account| date|depid|
+---+----+-------+--------+-----+
|153|4512| 30096|11272020| 30|
|153|4512| 30096|11272020| 30|
|145|4513| 40095|11272020| 200|
|135|4512| 30096|11272020| 0|
|153|4512| 30095|11272020| 3009|
|145|4513| 40095|11272020| 4009|
|145|4513| 40095|11272020| 4009|
|153|4512| 30097|11272020| 3009|
|145|4513| 30094|11272020| 3009|
+---+----+-------+--------+-----+

Related

Count distinct while aggregating others?

This is how my dataset looks like:
+---------+------------+-----------------+
| name |request_type| request_group_id|
+---------+------------+-----------------+
|Michael | X | 1020 |
|Michael | X | 1018 |
|Joe | Y | 1018 |
|Sam | X | 1018 |
|Michael | Y | 1021 |
|Sam | X | 1030 |
|Elizabeth| Y | 1035 |
+---------+------------+-----------------+
I want to calculate the amount of request_type's per person and count unique request_group_id's
Result should be following:
+---------+--------------------+---------------------+--------------------------------+
| name |cnt(request_type(X))| cnt(request_type(Y))| cnt(distinct(request_group_id))|
+---------+--------------------+---------------------+--------------------------------+
|Michael | 2 | 1 | 3 |
|Joe | 0 | 1 | 1 |
|Sam | 2 | 0 | 2 |
|John | 1 | 0 | 1 |
|Elizabeth| 0 | 1 | 1 |
+---------+--------------------+---------------------+--------------------------------+
What I've done so far: (helps to derive first two columns)
msgDataFrame.select(NAME, REQUEST_TYPE)
.groupBy(NAME)
.pivot(REQUEST_TYPE, Lists.newArrayList(X, Y))
.agg(functions.count(REQUEST_TYPE))
.show();
How to count distinct request_group_id's in this select? Is it possible to do within it?
I think it's possible only via two datasets join (my current result + separate aggregation by distinct request_group_id)
Example with "countDistinct" ("countDistinct" is not worked over window, replaced with "size","collect_set"):
val groupIdWindow = Window.partitionBy("name")
df.select($"name", $"request_type",
size(collect_set("request_group_id").over(groupIdWindow)).alias("countDistinct"))
.groupBy("name", "countDistinct")
.pivot($"request_type", Seq("X", "Y"))
.agg(count("request_type"))
.show(false)

Add column to a Dataset based on the value from Another Dataset

I have a dataset dsCustomer that have the customer details with columns
|customerID|idpt | totalAmount|
|customer1 | H1 | 250 |
|customer2 | H2 | 175 |
|customer3 | H3 | 4000 |
|customer4 | H3 | 9000 |
I have another dataset dsCategory that contains the category based on the amount sales
|categoryID|idpt | borne_min|borne_max|
|A | H2 | 0 |1000 |
|B | H2 | 1000 |5000 |
|C | H2 | 5000 |7000 |
|D | H2 | 7000 |10000 |
|F | H3 | 0 |1000 |
|G | H3 | 1000 |5000 |
|H | H3 | 5000 |7000 |
|I | H3 | 7000 |1000000 |
I would like to have a result that is taking the totalAmount of Customer and find the category.
|customerID|idpt |totalAmount|category|
|customer1 | H1 | 250 | null |
|customer2 | H2 | 175 | A |
|customer3 | H3 | 4000 | G |
|customer4 | H3 | 9000 | I |
//udf
public static Column getCategoryAmount(Dataset<Row> ds, Column amountColumn) {
return ds.filter(amountColumn.geq(col("borne_min"))
.and(amountColumn.lt(col("borne_max")))).first().getAs("categoryID");
}
//code to add column to my dataset
dsCustomer.withColumn("category", getCategoryAmount(dsCategory , dsCustomer.col("totalAmount")));
How can i pass the value of column from my dataset of customer to my UDF function
Because the error is showing that totalAmount is not contains in the category dataset
Question: How can i use Map to for each row in the dsCustomer i should go and check they value in dsCategory.
I have tried to join the 2 tables but it is working because the dsCustomer should maintain the same records just added the calulated column picked from dsCategory.
caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`totalAmount`' given input columns: [categoryID,borne_min,borne_max];;
'Filter (('totalAmount>= borne_min#220) && ('totalAmount < borne_max#221))
You have to join the two datasets. withColumn only allows modifications of the same Dataset.
UPDATE
I did not have time before to explain in detail what I mean. This is what I was trying to explain. You can join two dataframes. In your case you need a left join to preserve rows which don't have a matching category.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
cust = [
('customer1', 'H1', 250),
('customer2', 'H2', 175),
('customer3', 'H3', 4000),
('customer4', 'H3', 9000)
]
cust_df = spark.createDataFrame(cust, ['customerID', 'idpt', 'totalAmount'])
cust_df.show()
cat = [
('A', 'H2', 0 , 1000),
('B', 'H2', 1000, 5000),
('C', 'H2', 5000, 7000),
('D', 'H2', 7000, 10000),
('F', 'H3', 0 , 1000),
('G', 'H3', 1000, 5000),
('H', 'H3', 5000, 7000),
('I', 'H3', 7000, 1000000)
]
cat_df = spark.createDataFrame(cat, ['categoryID', 'idpt', 'borne_min', 'borne_max'])
cat_df.show()
cust_df.join(cat_df,
(cust_df.idpt == cat_df.idpt) &
(cust_df.totalAmount >= cat_df.borne_min) &
(cust_df.totalAmount <= cat_df.borne_max)
, how='left') \
.select(cust_df.customerID, cust_df.idpt, cust_df.totalAmount, cat_df.categoryID) \
.show()
Output
+----------+----+-----------+
|customerID|idpt|totalAmount|
+----------+----+-----------+
| customer1| H1| 250|
| customer2| H2| 175|
| customer3| H3| 4000|
| customer4| H3| 9000|
+----------+----+-----------+
+----------+----+---------+---------+
|categoryID|idpt|borne_min|borne_max|
+----------+----+---------+---------+
| A| H2| 0| 1000|
| B| H2| 1000| 5000|
| C| H2| 5000| 7000|
| D| H2| 7000| 10000|
| F| H3| 0| 1000|
| G| H3| 1000| 5000|
| H| H3| 5000| 7000|
| I| H3| 7000| 1000000|
+----------+----+---------+---------+
+----------+----+-----------+----------+
|customerID|idpt|totalAmount|categoryID|
+----------+----+-----------+----------+
| customer1| H1| 250| null|
| customer3| H3| 4000| G|
| customer4| H3| 9000| I|
| customer2| H2| 175| A|
+----------+----+-----------+----------+

Finding Similar columns in two dataframes using Spark

I have two DataFrames which has some data like this,
+-------+--------+------------------+---------+
|ADDRESS|CUSTOMER| CUSTOMERTIME| POL |
+-------+--------+------------------+---------+
| There| cust0|3069.4768999023245|578596829|
| There| cust0|3069.4768999023245|43831451 |
| Here| cust1| 15.29206776391711|578596829|
| There| cust0|3069.4768999023245|43831451 |
| Here| cust1| 15.29206776391711|578596829|
| Here| cust4| 32.04741866436953|43831451 |
+-------+--------+------------------+---------+
and
+---------+------------------+------------------+-----+-----+
| POLICY| POLICYENDTIME| POLICYSTARTTIME|PVAR0|PVAR1|
+---------+------------------+------------------+-----+-----+
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
| 43831451|3712.2672901111655|1744.9884452423225| 0| 6|
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
| 43831451|3712.2672901111655|1744.9884452423225| 0| 6|
| 43831451|3979.2754016079016|3712.2672901111655| 0| 5|
+---------+------------------+------------------+-----+-----+
now i want to compare this two data frames to find the matching column that i can join these DataFrames in the next step (In this case it would be POLICY and POL). Is there any algorithms or other ways that i can predict this?
Given df1 and df2 you can find common columns through
df1 = sc.parallelize([('1',),('2',)]).toDF(['a'])
df2 = sc.parallelize([('1','2'),('2','3')]).toDF(['a','b'])
>>>set(df1.columns).intersection(set(df2.columns))
set(['a'])
>>>list(set(df1.columns).intersection(set(df2.columns)))
['a']
This should get the difference
>>> list(set(df1.columns).symmetric_difference(set(df2.columns)))
['b']

How to perform a query using a field that is a merge of 2 columns?

I'm building up a series of distribution analysis using Java Spark library. This is the actual code I'm using to fetch the data from a JSON file and save the output.
Dataset<Row> dataset = spark.read().json("local/foods.json");
dataset.createOrReplaceTempView("cs_food");
List<GenericAnalyticsEntry> menu_distribution= spark
.sql(" ****REQUESTED QUERY ****")
.toJavaRDD()
.map(row -> Triple.of( row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
.map(GenericAnalyticsEntry::of)
.collect();
writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);
The query I'm looking for is based on this structure:
+------------+-------------+------------+------------+
| FIRST_FOOD | SECOND_FOOD | DATE | IS_SPECIAL |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 11/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Lasagna | Pizza | 12/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Spaghetti | 13/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 14/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Lasagna | 15/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pork | Mozzarella | 16/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Lasagna | Mozzarella | 17/02/2017 | FALSE |
+------------+-------------+------------+------------+
How can I achieve this (written below) output from the code written above?
+------------+--------------------+----------------------+
| FOODS | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza | 2 | 1 |
+------------+--------------------+----------------------+
| Lasagna | 2 | 1 |
+------------+--------------------+----------------------+
| Spaghetti | 2 | 3 |
+------------+--------------------+----------------------+
| Mozzarella | 0 | 2 |
+------------+--------------------+----------------------+
| Pork | 1 | 0 |
+------------+--------------------+----------------------+
I've of course tried to figure out a solution by myself but had no luck with the my tries, I may be wrong, but I need something like this:
"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"
From the example data, this looks like it will produce the output you want:
select
foods,
first_count,
second_count
from
(select first_food as food from menus
union select second_food from menus) as f
left join (
select first_food, count(*) as first_count from menus
group by first_food
) as ff on ff.first_food=f.food
left join (
select second_food, count(*) as second_count from menus
group by second_food
) as sf on sf.second_food=f.food
;
Simple combination of flatMap and groupBy should do the job like this (sorry, can't check if it 100% correct right now):
import spark.sqlContext.implicits._
val df = Seq(("Pizza", "Pasta"), ("Pizza", "Soup")).toDF("first", "second")
df.flatMap({case Row(first: String, second: String) => Seq((first, 1, 0), (second, 0, 1))})
.groupBy("_1")

Spark - sample() function duplicating data?

I want to randomly select a subset of my data and then limit it to 200 entries. But after using the sample() function, I'm getting duplicate rows, and I don't know why. Let me show you:
DataFrame df= sqlContext.sql("SELECT * " +
" FROM temptable" +
" WHERE conditions");
DataFrame df1 = df.select(df.col("col1"))
.where(df.col("col1").isNotNull())
.distinct()
.orderBy(df.col("col1"));
df.show();
System.out.println(df.count());
Up until now, everything is OK. I get the output:
+-----------+
|col1 |
+-----------+
| 10016|
| 10022|
| 100281|
| 10032|
| 100427|
| 100445|
| 10049|
| 10070|
| 10076|
| 10079|
| 10081|
| 10082|
| 100884|
| 10092|
| 10099|
| 10102|
| 10103|
| 101039|
| 101134|
| 101187|
+-----------+
only showing top 20 rows
10512
with 10512 records without duplicates. AND THEN!
df = df.sample(true, 0.5).limit(200);
df.show();
System.out.println(users.count());
This returns 200 rows full of duplicates:
+-----------+
|col1 |
+-----------+
| 10022|
| 100445|
| 100445|
| 10049|
| 10079|
| 10079|
| 10081|
| 10081|
| 10082|
| 10092|
| 10102|
| 10102|
| 101039|
| 101134|
| 101134|
| 101134|
| 101345|
| 101345|
| 10140|
| 10141|
+-----------+
only showing top 20 rows
200
Can anyone tell me why? This is driving me crazy. Thank you!
You explicitly ask for a sample with replacement so there is nothing unexpected about getting duplicates:
public Dataset<T> sample(boolean withReplacement, double fraction)

Categories

Resources