Finding Similar columns in two dataframes using Spark - java

I have two DataFrames which has some data like this,
+-------+--------+------------------+---------+
|ADDRESS|CUSTOMER| CUSTOMERTIME| POL |
+-------+--------+------------------+---------+
| There| cust0|3069.4768999023245|578596829|
| There| cust0|3069.4768999023245|43831451 |
| Here| cust1| 15.29206776391711|578596829|
| There| cust0|3069.4768999023245|43831451 |
| Here| cust1| 15.29206776391711|578596829|
| Here| cust4| 32.04741866436953|43831451 |
+-------+--------+------------------+---------+
and
+---------+------------------+------------------+-----+-----+
| POLICY| POLICYENDTIME| POLICYSTARTTIME|PVAR0|PVAR1|
+---------+------------------+------------------+-----+-----+
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
| 43831451|3712.2672901111655|1744.9884452423225| 0| 6|
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
| 43831451|3712.2672901111655|1744.9884452423225| 0| 6|
| 43831451|3979.2754016079016|3712.2672901111655| 0| 5|
+---------+------------------+------------------+-----+-----+
now i want to compare this two data frames to find the matching column that i can join these DataFrames in the next step (In this case it would be POLICY and POL). Is there any algorithms or other ways that i can predict this?

Given df1 and df2 you can find common columns through
df1 = sc.parallelize([('1',),('2',)]).toDF(['a'])
df2 = sc.parallelize([('1','2'),('2','3')]).toDF(['a','b'])
>>>set(df1.columns).intersection(set(df2.columns))
set(['a'])
>>>list(set(df1.columns).intersection(set(df2.columns)))
['a']
This should get the difference
>>> list(set(df1.columns).symmetric_difference(set(df2.columns)))
['b']

Related

Spark SQL: keep a non-key row after join

I have two dataset as following:
smoothieDs.show()
|smoothie_id | smoothie | price |
|1 | Tropical | 10 |
|2 | Green vegie | 20 |
and:
ingredientDs.show()
|smoothie | ingredient |
|Tropical | Mango |
|Tropical | Passion fruit |
|Green veggie | Cucumber |
|Green veggie | Kiwi |
I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field.
I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives:
|smoothie_id |price | smoothie | ingredient |
|1 |10 | Tropical | Mango |
|1 |10 | Tropical | Passion fruit |
But my expected result should be:
|smoothie_id |price | smoothie | ingredient |
|1 |10 | Tropical | Mango |
|1 |10 | Tropical | Passion fruit |
|2 |20 | Green veggie | To be communicated |
Is it possible to achieve this using join directly, if not what is the best way to achieve this ?
You can replace the ingredient based on the price after the join:
import org.apache.spark.sql.functions._
smoothieDs.join(ingredientDs, "smoothie")
.withColumn("ingredient", when('price.lt(15), 'ingredient).otherwise("To be communicated"))
.distinct()
.show()
Output:
+------------+-----------+-----+------------------+
| smoothie|smoothie_id|price| ingredient|
+------------+-----------+-----+------------------+
|Green veggie| 2| 20|To be communicated|
| Tropical| 1| 10| Mango|
| Tropical| 1| 10| Passion fruit|
+------------+-----------+-----+------------------+
Edit: another option would be to filter the ingredient dataset first and then do the join. This would avoid using distinct but comes at the price of a second join. Depending on the data this can or can not be faster.
smoothieDs.join(
ingredientDs.join(smoothieDs.filter('price.lt(15)), Seq("smoothie"), "left_semi"),
Seq("smoothie"), "left_outer")
.na.fill("To be communicated", Seq("ingredient"))
.show()

How to split a column into a list and save it into a new .csv file

I have a data frame with two columns: student ID and their courses. The course column has multiple values separated by ";". How can I split genres into a list and save every pair (studentID, genre1), (studetID, genre2) into a new CSV file?
You could try split and explode :
val df = Seq((1,("a;b;c"))).toDF("id","values")
df.show()
val df2 = df.select($"id", explode(split($"values",";")).as("value"))
df2.show()
df2.write.option("header", "true").csv("/path/to/csv");
+---+------+
| id|values|
+---+------+
| 1| a;b;c|
+---+------+
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 1| b|
| 1| c|
+---+-----+

Fill null values with empty string in Dataset<Row> using Apache-Spark in java

Please do not mark this question as duplicate. I have checked the below question and it gives solution for python or scala. And for java method is different.
How to replace null values with a specific value in Dataframe using spark in Java?
I have a Dataset Dataset<Row> ds which I created from reading a parquet file. So, all column values are string. Some of the values are null. I am using .na().fill("") for replacing null values with empty string
Dataset<Row> ds1 = ds.na().fill("");
But it is not removing null values. I am unable to understand what can be the reason.
|-- stopPrice: double (nullable = true)
|-- tradingCurrency: string (nullable = true)
From what I see, your column has a numeric type. Also you cannot replace a null value by an illegal value in Spark. Therefore in your case you cannot use a string ("" in your case). Here is an example that illustrate this:
Dataset<Row> df = spark.range(10)
.select(col("id"),
when(col("id").mod(2).equalTo(lit(0)), null )
.otherwise(col("id").cast("string")).as("string_col"),
when(col("id").mod(2).equalTo(lit(0)), null )
.otherwise(col("id")).as("int_col"));
df.na().fill("").show();
And here is the result
+---+----------+-------+
| id|string_col|int_col|
+---+----------+-------+
| 0| | null|
| 1| 1| 1|
| 2| | null|
| 3| 3| 3|
| 4| | null|
| 5| 5| 5|
| 6| | null|
| 7| 7| 7|
| 8| | null|
| 9| 9| 9|
+---+----------+-------+
It works for the string, but not for the integer. Note that I used the cast function to turn an int into a string and make the code work. It could be a nice workaround in your situation.

How to perform count by value operation on spark's Dataset without grouping values?

I have a table represented by spark Dataset< Row >
origin.show();
+------+
|Origin|
+------+
| USA|
| Japan|
| USA|
| USA|
| Japan|
|Europe|
+------+
I want to build additional "countByValue" column to get table like
+------+-----+
|Origin|Count|
+------+-----+
|Europe| 1|
| USA| 3|
| USA| 3|
| USA| 3|
| Japan| 2|
| Japan| 2|
+------+-----+
I found solution but it seems very inefficient. I group origin dataset and use count function.
Dataset<Row> grouped = origin.groupBy(originCol).agg(functions.count(originCol));
grouped.show();
+------+-----+
|Origin|Count|
+------+-----+
|Europe| 1|
| USA| 3|
| Japan| 2|
+------+-----+
Then I just join result table with origin dataset.
Dataset<Row> finalDs = origin.join(grouped, originCol);
Is there any other more efficiant way to perform such operation?
You can write query with Window:
origin.withColumn("cnt", count('Origin).over(Window.partitionBy('Origin)))
Remember to import org.apache.spark.sql.functions._ and org.apache.spark.sql.expressions.Window
This is what you need to do
org.apache.sql.functions._
val df = Seq(
("USA"),
("Japan"),
("USA"),
("USA"),
("Japan"),
("Europe")
).toDF("origin")
val result = df.groupBy("origin").agg(collect_list($"origin").alias("origin1"),
count("origin").alias("count"))
.withColumn("origin", explode($"origin1")).drop("origin")

How to design UDAF on nest array in spark Dataframe

Input data:
+---+----+----+
|idx| v1| v2|
+---+----+----+
| a| 1| 3|
| a|null| 2|
| a| 4| 5|
| b| 6| 1|
| b| 7|null|
+---+----+----+
And what I want:
+---+-------------------------------------------+
|idx|total |
+---+-------------------------------------------+
|b |[WrappedArray(6, 7), WrappedArray(1)] |
|a |[WrappedArray(1, 4), WrappedArray(3, 2, 5)]|
+---+-------------------------------------------+
I know I can get this throuth
df.groupBy("idx").agg(array(collect_list(col("v1")), collect_list(col("v2"))));
But I want to achieve the result throught UDAF in JAVA.

Categories

Resources