Dataset Manipulation in Spark Java API

Dataset Manipulation in Spark Java API - java

I have a Dataset DS1 below. I want to build DS2 using Spark Java API.
DS1:
+---------+------------+------------+
| account| amount | type |
+---------+------------+------------+
| c1 | 100 | D |
| c1 | 200 | C |
| c2 | 500 | C |
DS2:
amount1 is DS1 amount where type = D and amount2 is DS1 amount where type = C
+---------+------------+------------+
| account| amount1 | amount2 |
+---------+------------+------------+
| c1 | 100 | 200 |
| c2 | 0 | 500 |
Can someone help me please?

For transforming ds1 to ds2 in the expected format, you can use following code-
val ds2 = ds1
.withColumn("amount1", when($"type" === "D", $"amount").otherwise(0))
.withColumn("amount2", when($"type" === "C", $"amount").otherwise(0))
.select("account", "amount1", "amount2")
.groupBy($"account")
.agg(Map("amount1" -> "sum", "amount2" -> "sum"))
I hope it helps!

Related

How to effectively generate Spark dataset filled with random values?

I need a Dataset<Double> of arbitrary size filled with random or generated values.
It seems it can be done by implementing RDD, and generating values inside compute method.
Is there a better solution?

You can try Random Data Generation
sql functions to generate columns filled with random values
Two supported distributions: uniform and normal
Useful for randomized algorithms, prototyping and performance testing
import org.apache.spark.sql.functions.{rand, randn}
val dfr = sqlContext.range(0,10) // range can be what you want
val randomValues = dfr.select("id")
.withColumn("uniform", rand(10L))
.withColumn("normal", randn(10L))
randomValues.show(truncate = false)
output
+---+-------------------+--------------------+
|id |uniform |normal |
+---+-------------------+--------------------+
|0 |0.41371264720975787|-0.5877482396744728 |
|1 |0.7311719281896606 |1.5746327759749246 |
|2 |0.1982919638208397 |-0.256535324205377 |
|3 |0.12714181165849525|-0.31703264334668824|
|4 |0.7604318153406678 |0.4977629425313746 |
|5 |0.12030715258495939|-0.506853671746243 |
|6 |0.12131363910425985|1.4250903895905769 |
|7 |0.44292918521277047|-0.1413699193557902 |
|8 |0.8898784253886249 |0.9657665088756656 |
|9 |0.03650707717266999|-0.5021009082343131 |
+---+-------------------+--------------------+

Not sure if it helps you, but have a look -
val end = 100 // change this as required
val ds = spark.sql(s"select value from values (sequence(0, $end)) T(value)")
.selectExpr("explode(value) as value").selectExpr("(value * rand()) value")
.as(Encoders.DOUBLE)
ds.show(false)
ds.printSchema()
/**
* +-------------------+
* |value |
* +-------------------+
* |0.0 |
* |0.6598598027815629 |
* |0.34305452447822704|
* |0.2421654251914631 |
* |3.1937041196518896 |
* |0.9120972627613766 |
* |3.307431250924596 |
*
* root
* |-- value: double (nullable = false)
*/

Another way of doing it,
scala> val ds = spark.range(100)
ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val randDS = ds.withColumn("randomDouble", rand(100)).drop("id").as[Double]
randDS: org.apache.spark.sql.Dataset[Double] = [randomDouble: double]
scala> randDS.show
+--------------------+
| randomDouble|
+--------------------+
| 0.6841403791584381|
| 0.21180593775249568|
|0.020396922902442105|
| 0.3372830927732784|
| 0.967636350481069|
| 0.6420539234134518|
| 0.33027994655769854|
| 0.8027165538297113|
| 0.9938809031700999|
| 0.8346083871437393|
| 0.13512419677124388|
|0.061866246009553594|
| 0.5243597971107068|
| 0.38257478262291045|
| 0.6753627729921755|
| 0.9631590027671125|
| 0.14234112716353464|
| 0.38649575105988976|
| 0.7687994020915501|
| 0.8436272154312096|
+--------------------+

Spark 2.3 with Java8 transform a row to columns

I am new to Spark 2.4 with Java 8. I need help. Here is example of instances:
Source DataFrame
+--------------+
| key | Value |
+--------------+
| A | John |
| B | Nick |
| A | Mary |
| B | Kathy |
| C | Sabrina|
| B | George |
+--------------+
Meta DataFrame
+-----+
| key |
+-----+
| A |
| B |
| C |
| D |
| E |
| F |
+-----+
I would like to transform it to the following: Column names from Meta Dataframe and Rows will be transformed based on Source Dataframe
+-----------------------------------------------+
| A | B | C | D | E | F |
+-----------------------------------------------+
| John | Nick | Sabrina | null | null | null |
| Mary | Kathy | null | null | null | null |
| null | George | null | null | null | null |
+-----------------------------------------------+
Need to write a code Spark 2.3 with Java8. Appreciated your help.

To make things clearer (and easily reproducible) let's define dataframes:
val df1 = Seq("A" -> "John", "B" -> "Nick", "A" -> "Mary",
"B" -> "Kathy", "C" -> "Sabrina", "B" -> "George")
.toDF("key", "value")
val df2 = Seq("A", "B", "C", "D", "E", "F").toDF("key")
From what I see, you are trying to create one column by value in the key column of df2. These columns should contain all the values of the value column that are associated to the key naming the column. If we take an example, column A's first value should be the value of the first occurrence of A (if it exists, null otherwise): "John". Its second value should be the value of the second occurrence of A: "Mary". There is no third value so the third value of the column should be null.
I detailed it to show that we need a notion of rank of the values for each key (windowing function), and group by that notion of rank. It would go as follows:
import org.apache.spark.sql.expressions.Window
val df1_win = df1
.withColumn("id", monotonically_increasing_id)
.withColumn("rank", rank() over Window.partitionBy("key").orderBy("id"))
// the id is just here to maintain the original order.
// getting the keys in df2. Add distinct if there are duplicates.
val keys = df2.collect.map(_.getAs[String](0)).sorted
// then it's just about pivoting
df1_win
.groupBy("rank")
.pivot("key", keys)
.agg(first('value))
.orderBy("rank")
//.drop("rank") // I keep here it for clarity
.show()
+----+----+------+-------+----+----+----+
|rank| A| B| C| D| E| F|
+----+----+------+-------+----+----+----+
| 1|John| Nick|Sabrina|null|null|null|
| 2|Mary| Kathy| null|null|null|null|
| 3|null|George| null|null|null|null|
+----+----+------+-------+----+----+----+
Here is the very same code in Java
Dataset<Row> df1_win = df1
.withColumn("id", functions.monotonically_increasing_id())
.withColumn("rank", functions.rank().over(Window.partitionBy("key").orderBy("id")));
// the id is just here to maintain the original order.
// getting the keys in df2. Add distinct if there are duplicates.
// Note that it is a list of objects, to match the (strange) signature of pivot
List<Object> keys = df2.collectAsList().stream()
.map(x -> x.getString(0))
.sorted().collect(Collectors.toList());
// then it's just about pivoting
df1_win
.groupBy("rank")
.pivot("key", keys)
.agg(functions.first(functions.col("value")))
.orderBy("rank")
// .drop("rank") // I keep here it for clarity
.show();

Count distinct while aggregating others?

This is how my dataset looks like:
+---------+------------+-----------------+
| name |request_type| request_group_id|
+---------+------------+-----------------+
|Michael | X | 1020 |
|Michael | X | 1018 |
|Joe | Y | 1018 |
|Sam | X | 1018 |
|Michael | Y | 1021 |
|Sam | X | 1030 |
|Elizabeth| Y | 1035 |
+---------+------------+-----------------+
I want to calculate the amount of request_type's per person and count unique request_group_id's
Result should be following:
+---------+--------------------+---------------------+--------------------------------+
| name |cnt(request_type(X))| cnt(request_type(Y))| cnt(distinct(request_group_id))|
+---------+--------------------+---------------------+--------------------------------+
|Michael | 2 | 1 | 3 |
|Joe | 0 | 1 | 1 |
|Sam | 2 | 0 | 2 |
|John | 1 | 0 | 1 |
|Elizabeth| 0 | 1 | 1 |
+---------+--------------------+---------------------+--------------------------------+
What I've done so far: (helps to derive first two columns)
msgDataFrame.select(NAME, REQUEST_TYPE)
.groupBy(NAME)
.pivot(REQUEST_TYPE, Lists.newArrayList(X, Y))
.agg(functions.count(REQUEST_TYPE))
.show();
How to count distinct request_group_id's in this select? Is it possible to do within it?
I think it's possible only via two datasets join (my current result + separate aggregation by distinct request_group_id)

Example with "countDistinct" ("countDistinct" is not worked over window, replaced with "size","collect_set"):
val groupIdWindow = Window.partitionBy("name")
df.select($"name", $"request_type",
size(collect_set("request_group_id").over(groupIdWindow)).alias("countDistinct"))
.groupBy("name", "countDistinct")
.pivot($"request_type", Seq("X", "Y"))
.agg(count("request_type"))
.show(false)

Add column to a Dataset based on the value from Another Dataset

I have a dataset dsCustomer that have the customer details with columns
|customerID|idpt | totalAmount|
|customer1 | H1 | 250 |
|customer2 | H2 | 175 |
|customer3 | H3 | 4000 |
|customer4 | H3 | 9000 |
I have another dataset dsCategory that contains the category based on the amount sales
|categoryID|idpt | borne_min|borne_max|
|A | H2 | 0 |1000 |
|B | H2 | 1000 |5000 |
|C | H2 | 5000 |7000 |
|D | H2 | 7000 |10000 |
|F | H3 | 0 |1000 |
|G | H3 | 1000 |5000 |
|H | H3 | 5000 |7000 |
|I | H3 | 7000 |1000000 |
I would like to have a result that is taking the totalAmount of Customer and find the category.
|customerID|idpt |totalAmount|category|
|customer1 | H1 | 250 | null |
|customer2 | H2 | 175 | A |
|customer3 | H3 | 4000 | G |
|customer4 | H3 | 9000 | I |
//udf
public static Column getCategoryAmount(Dataset<Row> ds, Column amountColumn) {
return ds.filter(amountColumn.geq(col("borne_min"))
.and(amountColumn.lt(col("borne_max")))).first().getAs("categoryID");
}
//code to add column to my dataset
dsCustomer.withColumn("category", getCategoryAmount(dsCategory , dsCustomer.col("totalAmount")));
How can i pass the value of column from my dataset of customer to my UDF function
Because the error is showing that totalAmount is not contains in the category dataset
Question: How can i use Map to for each row in the dsCustomer i should go and check they value in dsCategory.
I have tried to join the 2 tables but it is working because the dsCustomer should maintain the same records just added the calulated column picked from dsCategory.
caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`totalAmount`' given input columns: [categoryID,borne_min,borne_max];;
'Filter (('totalAmount>= borne_min#220) && ('totalAmount < borne_max#221))

You have to join the two datasets. withColumn only allows modifications of the same Dataset.
UPDATE
I did not have time before to explain in detail what I mean. This is what I was trying to explain. You can join two dataframes. In your case you need a left join to preserve rows which don't have a matching category.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
cust = [
('customer1', 'H1', 250),
('customer2', 'H2', 175),
('customer3', 'H3', 4000),
('customer4', 'H3', 9000)
]
cust_df = spark.createDataFrame(cust, ['customerID', 'idpt', 'totalAmount'])
cust_df.show()
cat = [
('A', 'H2', 0 , 1000),
('B', 'H2', 1000, 5000),
('C', 'H2', 5000, 7000),
('D', 'H2', 7000, 10000),
('F', 'H3', 0 , 1000),
('G', 'H3', 1000, 5000),
('H', 'H3', 5000, 7000),
('I', 'H3', 7000, 1000000)
]
cat_df = spark.createDataFrame(cat, ['categoryID', 'idpt', 'borne_min', 'borne_max'])
cat_df.show()
cust_df.join(cat_df,
(cust_df.idpt == cat_df.idpt) &
(cust_df.totalAmount >= cat_df.borne_min) &
(cust_df.totalAmount <= cat_df.borne_max)
, how='left') \
.select(cust_df.customerID, cust_df.idpt, cust_df.totalAmount, cat_df.categoryID) \
.show()
Output
+----------+----+-----------+
|customerID|idpt|totalAmount|
+----------+----+-----------+
| customer1| H1| 250|
| customer2| H2| 175|
| customer3| H3| 4000|
| customer4| H3| 9000|
+----------+----+-----------+
+----------+----+---------+---------+
|categoryID|idpt|borne_min|borne_max|
+----------+----+---------+---------+
| A| H2| 0| 1000|
| B| H2| 1000| 5000|
| C| H2| 5000| 7000|
| D| H2| 7000| 10000|
| F| H3| 0| 1000|
| G| H3| 1000| 5000|
| H| H3| 5000| 7000|
| I| H3| 7000| 1000000|
+----------+----+---------+---------+
+----------+----+-----------+----------+
|customerID|idpt|totalAmount|categoryID|
+----------+----+-----------+----------+
| customer1| H1| 250| null|
| customer3| H3| 4000| G|
| customer4| H3| 9000| I|
| customer2| H2| 175| A|
+----------+----+-----------+----------+

Spark - sample() function duplicating data?

I want to randomly select a subset of my data and then limit it to 200 entries. But after using the sample() function, I'm getting duplicate rows, and I don't know why. Let me show you:
DataFrame df= sqlContext.sql("SELECT * " +
" FROM temptable" +
" WHERE conditions");
DataFrame df1 = df.select(df.col("col1"))
.where(df.col("col1").isNotNull())
.distinct()
.orderBy(df.col("col1"));
df.show();
System.out.println(df.count());
Up until now, everything is OK. I get the output:
+-----------+
|col1 |
+-----------+
| 10016|
| 10022|
| 100281|
| 10032|
| 100427|
| 100445|
| 10049|
| 10070|
| 10076|
| 10079|
| 10081|
| 10082|
| 100884|
| 10092|
| 10099|
| 10102|
| 10103|
| 101039|
| 101134|
| 101187|
+-----------+
only showing top 20 rows
10512
with 10512 records without duplicates. AND THEN!
df = df.sample(true, 0.5).limit(200);
df.show();
System.out.println(users.count());
This returns 200 rows full of duplicates:
+-----------+
|col1 |
+-----------+
| 10022|
| 100445|
| 100445|
| 10049|
| 10079|
| 10079|
| 10081|
| 10081|
| 10082|
| 10092|
| 10102|
| 10102|
| 101039|
| 101134|
| 101134|
| 101134|
| 101345|
| 101345|
| 10140|
| 10141|
+-----------+
only showing top 20 rows
200
Can anyone tell me why? This is driving me crazy. Thank you!

You explicitly ask for a sample with replacement so there is nothing unexpected about getting duplicates:
public Dataset<T> sample(boolean withReplacement, double fraction)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Dataset Manipulation in Spark Java API - java

Related

How to effectively generate Spark dataset filled with random values?

Spark 2.3 with Java8 transform a row to columns

Count distinct while aggregating others?

Add column to a Dataset based on the value from Another Dataset

Spark - sample() function duplicating data?

Categories

Resources