I have the below dataframe in spark
+---------+--------------+-------+------------+--------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+--------+
| 153|4512 | 30095|11272020 | 0|
| 153|4512 | 30096|11272020 | 30|
| 145|4513 | 40095|11272020 | 0|
| 135|4512 | 30096|11272020 | 0|
| 153|4512 | 30097|11272020 | 0|
| 145|4513 | 30094|11272020 | 0|
+---------+--------------+-------+------------+--------+
I need to group the records by pid, tid and date so after grouping the dataframe looks like
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 0 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 0|
| 145|4513 | 40095|11272020 | 0|
| 145|4513 | 30094|11272020 | 0|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
After grouping ,I need to check if any records in this group has an account in 30095 or 40095 then need to replace all the records in that group whose depId is 0 with first 4 digits of account , the expected outcome is
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 3009 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 3009|
| 145|4513 | 40095|11272020 | 4009|
| 145|4513 | 30094|11272020 | 4009|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
I tried the below code but it is not working for me
WindowSpec windowSpec = Window.partitionBy("pid","tid","date").orderBy("account");
Column roworder = rank().over(windowSpec).as("rank");
Dataset<Row> df1 = df.select(df.col("*"),roworder);
Dataset<Row> df2 = df1.withColumn("depid1",
.when(df1.col("account").equalTo("40095").and(df1.col("depid").equalTo("0")), 4009)
.when(df1.col("rank").gt(1).and(df1.col("depid").equalTo("0")), 4009)
.when(df1.col("account").equalTo("30095").and(df1.col("depid").equalTo("0")), 3009)
.when(df1.col("rank").gt(1).and(df1.col("depid").equalTo("0")), 3009)
.otherwise(df1.col("depid"))
).orderBy(col("pid").desc()).drop("depid1").withColumnRenamed("sourcedid1","depid")
but it is producing the below output as
+---------+--------------+-------+------------+---------+
|pid | tid |account|date |depid |
+---------+--------------+-------+------------+---------+
| 153|4512 | 30095|11272020 | 3009 |
| 153|4512 | 30096|11272020 | 30|
| 153|4512 | 30097|11272020 | 4009|
| 145|4513 | 40095|11272020 | 4009|
| 145|4513 | 30094|11272020 | 4009|
| 135|4512 | 30096|11272020 | 0|
+---------+--------------+-------+------------+---------+
I am not sure what am I doing incorrectly here
You will need to convert to JAVA. I suggest you use the Scala API, it makes life far easier. Also, you may have different data types.
Here is my alternative which I see more as a data analysis task. I added some extra records to demonstrate the point and make more generic and robust. I do not think your approach is sound enough. Anyway, we can all learn.
So, here goes:
import org.apache.spark.sql.functions._
///...
// More a data analysys problem.
// 1. Gen sample data.
val df = Seq( ( 153, 4512, "30095", "11272020", 0 ),
( 153, 4512, "30096", "11272020", 30 ),
( 153, 4512, "30096", "11272020", 30 ), // extra record
( 145, 4513, "40095", "11272020", 0 ),
( 145, 4513, "40095", "11272020", 0 ), // extra record
( 145, 4513, "40095", "11272020", 200 ), // extra record
( 135, 4512, "30096", "11272020", 0 ),
( 153, 4512, "30097", "11272020", 0 ),
( 145, 4513, "30094", "11272020", 0 )
).toDF("pid","tid","account","date","depid")
df.show()
// 2. Get the groups with accounts of relevance. Note they may have records not needing to be processed.
val dfg = df.filter(df("account").isin("30095", "40095")).select("pid","tid","date").distinct().toDF("pidg", "tidg", "dateg")
dfg.show()
// 3. Get the data that needs to be processed. Take into account performance.
val dfp = df.as("df").join(dfg.as("dfg"), $"df.pid" === $"dfg.pidg" && $"df.tid" === $"dfg.tidg" && $"df.date" === $"dfg.dateg" && $"df.depid" === 0, "inner")
.drop("pidg").drop("tidg").drop("dateg")
dfp.show()
// 4. Get records that need not be processed for later UNION operation.
val res1 = df.exceptAll(dfp)
res1.show()
// 5. Process those records needed.
val res2 = dfp.withColumn("depid2", substring(col("account"), 0, 4).cast("int")).drop("depid").toDF("pid","tid","account","date","depid")
res2.show()
// 6. Final result.
val res = res1.union(res2)
res.show()
results finally in, in a performant way:
+---+----+-------+--------+-----+
|pid| tid|account| date|depid|
+---+----+-------+--------+-----+
|153|4512| 30096|11272020| 30|
|153|4512| 30096|11272020| 30|
|145|4513| 40095|11272020| 200|
|135|4512| 30096|11272020| 0|
|153|4512| 30095|11272020| 3009|
|145|4513| 40095|11272020| 4009|
|145|4513| 40095|11272020| 4009|
|153|4512| 30097|11272020| 3009|
|145|4513| 30094|11272020| 3009|
+---+----+-------+--------+-----+
I have two dataset as following:
smoothieDs.show()
|smoothie_id | smoothie | price |
|1 | Tropical | 10 |
|2 | Green vegie | 20 |
and:
ingredientDs.show()
|smoothie | ingredient |
|Tropical | Mango |
|Tropical | Passion fruit |
|Green veggie | Cucumber |
|Green veggie | Kiwi |
I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field.
I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives:
|smoothie_id |price | smoothie | ingredient |
|1 |10 | Tropical | Mango |
|1 |10 | Tropical | Passion fruit |
But my expected result should be:
|smoothie_id |price | smoothie | ingredient |
|1 |10 | Tropical | Mango |
|1 |10 | Tropical | Passion fruit |
|2 |20 | Green veggie | To be communicated |
Is it possible to achieve this using join directly, if not what is the best way to achieve this ?
You can replace the ingredient based on the price after the join:
import org.apache.spark.sql.functions._
smoothieDs.join(ingredientDs, "smoothie")
.withColumn("ingredient", when('price.lt(15), 'ingredient).otherwise("To be communicated"))
.distinct()
.show()
Output:
+------------+-----------+-----+------------------+
| smoothie|smoothie_id|price| ingredient|
+------------+-----------+-----+------------------+
|Green veggie| 2| 20|To be communicated|
| Tropical| 1| 10| Mango|
| Tropical| 1| 10| Passion fruit|
+------------+-----------+-----+------------------+
Edit: another option would be to filter the ingredient dataset first and then do the join. This would avoid using distinct but comes at the price of a second join. Depending on the data this can or can not be faster.
smoothieDs.join(
ingredientDs.join(smoothieDs.filter('price.lt(15)), Seq("smoothie"), "left_semi"),
Seq("smoothie"), "left_outer")
.na.fill("To be communicated", Seq("ingredient"))
.show()
Please do not mark this question as duplicate. I have checked the below question and it gives solution for python or scala. And for java method is different.
How to replace null values with a specific value in Dataframe using spark in Java?
I have a Dataset Dataset<Row> ds which I created from reading a parquet file. So, all column values are string. Some of the values are null. I am using .na().fill("") for replacing null values with empty string
Dataset<Row> ds1 = ds.na().fill("");
But it is not removing null values. I am unable to understand what can be the reason.
|-- stopPrice: double (nullable = true)
|-- tradingCurrency: string (nullable = true)
From what I see, your column has a numeric type. Also you cannot replace a null value by an illegal value in Spark. Therefore in your case you cannot use a string ("" in your case). Here is an example that illustrate this:
Dataset<Row> df = spark.range(10)
.select(col("id"),
when(col("id").mod(2).equalTo(lit(0)), null )
.otherwise(col("id").cast("string")).as("string_col"),
when(col("id").mod(2).equalTo(lit(0)), null )
.otherwise(col("id")).as("int_col"));
df.na().fill("").show();
And here is the result
+---+----------+-------+
| id|string_col|int_col|
+---+----------+-------+
| 0| | null|
| 1| 1| 1|
| 2| | null|
| 3| 3| 3|
| 4| | null|
| 5| 5| 5|
| 6| | null|
| 7| 7| 7|
| 8| | null|
| 9| 9| 9|
+---+----------+-------+
It works for the string, but not for the integer. Note that I used the cast function to turn an int into a string and make the code work. It could be a nice workaround in your situation.
I have two DataFrames which has some data like this,
+-------+--------+------------------+---------+
|ADDRESS|CUSTOMER| CUSTOMERTIME| POL |
+-------+--------+------------------+---------+
| There| cust0|3069.4768999023245|578596829|
| There| cust0|3069.4768999023245|43831451 |
| Here| cust1| 15.29206776391711|578596829|
| There| cust0|3069.4768999023245|43831451 |
| Here| cust1| 15.29206776391711|578596829|
| Here| cust4| 32.04741866436953|43831451 |
+-------+--------+------------------+---------+
and
+---------+------------------+------------------+-----+-----+
| POLICY| POLICYENDTIME| POLICYSTARTTIME|PVAR0|PVAR1|
+---------+------------------+------------------+-----+-----+
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
| 43831451|3712.2672901111655|1744.9884452423225| 0| 6|
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
| 43831451|3712.2672901111655|1744.9884452423225| 0| 6|
| 43831451|3979.2754016079016|3712.2672901111655| 0| 5|
+---------+------------------+------------------+-----+-----+
now i want to compare this two data frames to find the matching column that i can join these DataFrames in the next step (In this case it would be POLICY and POL). Is there any algorithms or other ways that i can predict this?
Given df1 and df2 you can find common columns through
df1 = sc.parallelize([('1',),('2',)]).toDF(['a'])
df2 = sc.parallelize([('1','2'),('2','3')]).toDF(['a','b'])
>>>set(df1.columns).intersection(set(df2.columns))
set(['a'])
>>>list(set(df1.columns).intersection(set(df2.columns)))
['a']
This should get the difference
>>> list(set(df1.columns).symmetric_difference(set(df2.columns)))
['b']
I have a table represented by spark Dataset< Row >
origin.show();
+------+
|Origin|
+------+
| USA|
| Japan|
| USA|
| USA|
| Japan|
|Europe|
+------+
I want to build additional "countByValue" column to get table like
+------+-----+
|Origin|Count|
+------+-----+
|Europe| 1|
| USA| 3|
| USA| 3|
| USA| 3|
| Japan| 2|
| Japan| 2|
+------+-----+
I found solution but it seems very inefficient. I group origin dataset and use count function.
Dataset<Row> grouped = origin.groupBy(originCol).agg(functions.count(originCol));
grouped.show();
+------+-----+
|Origin|Count|
+------+-----+
|Europe| 1|
| USA| 3|
| Japan| 2|
+------+-----+
Then I just join result table with origin dataset.
Dataset<Row> finalDs = origin.join(grouped, originCol);
Is there any other more efficiant way to perform such operation?
You can write query with Window:
origin.withColumn("cnt", count('Origin).over(Window.partitionBy('Origin)))
Remember to import org.apache.spark.sql.functions._ and org.apache.spark.sql.expressions.Window
This is what you need to do
org.apache.sql.functions._
val df = Seq(
("USA"),
("Japan"),
("USA"),
("USA"),
("Japan"),
("Europe")
).toDF("origin")
val result = df.groupBy("origin").agg(collect_list($"origin").alias("origin1"),
count("origin").alias("count"))
.withColumn("origin", explode($"origin1")).drop("origin")