I am new to Spark 2.4 with Java 8. I need help. Here is example of instances:
Source DataFrame
+--------------+
| key | Value |
+--------------+
| A | John |
| B | Nick |
| A | Mary |
| B | Kathy |
| C | Sabrina|
| B | George |
+--------------+
Meta DataFrame
+-----+
| key |
+-----+
| A |
| B |
| C |
| D |
| E |
| F |
+-----+
I would like to transform it to the following: Column names from Meta Dataframe and Rows will be transformed based on Source Dataframe
+-----------------------------------------------+
| A | B | C | D | E | F |
+-----------------------------------------------+
| John | Nick | Sabrina | null | null | null |
| Mary | Kathy | null | null | null | null |
| null | George | null | null | null | null |
+-----------------------------------------------+
Need to write a code Spark 2.3 with Java8. Appreciated your help.
To make things clearer (and easily reproducible) let's define dataframes:
val df1 = Seq("A" -> "John", "B" -> "Nick", "A" -> "Mary",
"B" -> "Kathy", "C" -> "Sabrina", "B" -> "George")
.toDF("key", "value")
val df2 = Seq("A", "B", "C", "D", "E", "F").toDF("key")
From what I see, you are trying to create one column by value in the key column of df2. These columns should contain all the values of the value column that are associated to the key naming the column. If we take an example, column A's first value should be the value of the first occurrence of A (if it exists, null otherwise): "John". Its second value should be the value of the second occurrence of A: "Mary". There is no third value so the third value of the column should be null.
I detailed it to show that we need a notion of rank of the values for each key (windowing function), and group by that notion of rank. It would go as follows:
import org.apache.spark.sql.expressions.Window
val df1_win = df1
.withColumn("id", monotonically_increasing_id)
.withColumn("rank", rank() over Window.partitionBy("key").orderBy("id"))
// the id is just here to maintain the original order.
// getting the keys in df2. Add distinct if there are duplicates.
val keys = df2.collect.map(_.getAs[String](0)).sorted
// then it's just about pivoting
df1_win
.groupBy("rank")
.pivot("key", keys)
.agg(first('value))
.orderBy("rank")
//.drop("rank") // I keep here it for clarity
.show()
+----+----+------+-------+----+----+----+
|rank| A| B| C| D| E| F|
+----+----+------+-------+----+----+----+
| 1|John| Nick|Sabrina|null|null|null|
| 2|Mary| Kathy| null|null|null|null|
| 3|null|George| null|null|null|null|
+----+----+------+-------+----+----+----+
Here is the very same code in Java
Dataset<Row> df1_win = df1
.withColumn("id", functions.monotonically_increasing_id())
.withColumn("rank", functions.rank().over(Window.partitionBy("key").orderBy("id")));
// the id is just here to maintain the original order.
// getting the keys in df2. Add distinct if there are duplicates.
// Note that it is a list of objects, to match the (strange) signature of pivot
List<Object> keys = df2.collectAsList().stream()
.map(x -> x.getString(0))
.sorted().collect(Collectors.toList());
// then it's just about pivoting
df1_win
.groupBy("rank")
.pivot("key", keys)
.agg(functions.first(functions.col("value")))
.orderBy("rank")
// .drop("rank") // I keep here it for clarity
.show();
Related
I need a Dataset<Double> of arbitrary size filled with random or generated values.
It seems it can be done by implementing RDD, and generating values inside compute method.
Is there a better solution?
You can try Random Data Generation
sql functions to generate columns filled with random values
Two supported distributions: uniform and normal
Useful for randomized algorithms, prototyping and performance testing
import org.apache.spark.sql.functions.{rand, randn}
val dfr = sqlContext.range(0,10) // range can be what you want
val randomValues = dfr.select("id")
.withColumn("uniform", rand(10L))
.withColumn("normal", randn(10L))
randomValues.show(truncate = false)
output
+---+-------------------+--------------------+
|id |uniform |normal |
+---+-------------------+--------------------+
|0 |0.41371264720975787|-0.5877482396744728 |
|1 |0.7311719281896606 |1.5746327759749246 |
|2 |0.1982919638208397 |-0.256535324205377 |
|3 |0.12714181165849525|-0.31703264334668824|
|4 |0.7604318153406678 |0.4977629425313746 |
|5 |0.12030715258495939|-0.506853671746243 |
|6 |0.12131363910425985|1.4250903895905769 |
|7 |0.44292918521277047|-0.1413699193557902 |
|8 |0.8898784253886249 |0.9657665088756656 |
|9 |0.03650707717266999|-0.5021009082343131 |
+---+-------------------+--------------------+
Not sure if it helps you, but have a look -
val end = 100 // change this as required
val ds = spark.sql(s"select value from values (sequence(0, $end)) T(value)")
.selectExpr("explode(value) as value").selectExpr("(value * rand()) value")
.as(Encoders.DOUBLE)
ds.show(false)
ds.printSchema()
/**
* +-------------------+
* |value |
* +-------------------+
* |0.0 |
* |0.6598598027815629 |
* |0.34305452447822704|
* |0.2421654251914631 |
* |3.1937041196518896 |
* |0.9120972627613766 |
* |3.307431250924596 |
*
* root
* |-- value: double (nullable = false)
*/
Another way of doing it,
scala> val ds = spark.range(100)
ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val randDS = ds.withColumn("randomDouble", rand(100)).drop("id").as[Double]
randDS: org.apache.spark.sql.Dataset[Double] = [randomDouble: double]
scala> randDS.show
+--------------------+
| randomDouble|
+--------------------+
| 0.6841403791584381|
| 0.21180593775249568|
|0.020396922902442105|
| 0.3372830927732784|
| 0.967636350481069|
| 0.6420539234134518|
| 0.33027994655769854|
| 0.8027165538297113|
| 0.9938809031700999|
| 0.8346083871437393|
| 0.13512419677124388|
|0.061866246009553594|
| 0.5243597971107068|
| 0.38257478262291045|
| 0.6753627729921755|
| 0.9631590027671125|
| 0.14234112716353464|
| 0.38649575105988976|
| 0.7687994020915501|
| 0.8436272154312096|
+--------------------+
I have a Dataset like below
Dataset<Row> dataset = ...
dataset.show()
| NAME | DOB |
+------+----------+
| John | 19801012 |
| Mark | 19760502 |
| Mick | 19911208 |
I want to convert it to below (formatted DOB)
| NAME | DOB |
+------+------------+
| John | 1980-10-12 |
| Mark | 1976-05-02 |
| Mick | 1991-12-08 |
How can I do this? Basically, I am trying to figure out how to manipulate existing column string values in a generic way.
I tried using dataset.withColumn but couldn't quite figure out how to achieve this.
Appreciate any help.
With "substring" and "concat" functions:
df.withColumn("DOB_FORMATED",
concat(substring($"DOB", 0, 4), lit("-"), substring($"DOB", 5, 2), lit("-"), substring($"DOB", 7, 2)))
Load the data into a dataframe(deltaData) and just use the following line
deltaData.withColumn("DOB", date_format(to_date($"DOB", "yyyyMMdd"), "yyyy-MM-dd")).show()
Assuming DOB is a String you could write a UDF
def formatDate(s: String): String {
// date formatting code
}
val formatDateUdf = udf(formatDate(_: String))
ds.select($"NAME", formatDateUdf($"DOB").as("DOB"))
I have a Dataset DS1 below. I want to build DS2 using Spark Java API.
DS1:
+---------+------------+------------+
| account| amount | type |
+---------+------------+------------+
| c1 | 100 | D |
| c1 | 200 | C |
| c2 | 500 | C |
DS2:
amount1 is DS1 amount where type = D and amount2 is DS1 amount where type = C
+---------+------------+------------+
| account| amount1 | amount2 |
+---------+------------+------------+
| c1 | 100 | 200 |
| c2 | 0 | 500 |
Can someone help me please?
For transforming ds1 to ds2 in the expected format, you can use following code-
val ds2 = ds1
.withColumn("amount1", when($"type" === "D", $"amount").otherwise(0))
.withColumn("amount2", when($"type" === "C", $"amount").otherwise(0))
.select("account", "amount1", "amount2")
.groupBy($"account")
.agg(Map("amount1" -> "sum", "amount2" -> "sum"))
I hope it helps!
I'm building up a series of distribution analysis using Java Spark library. This is the actual code I'm using to fetch the data from a JSON file and save the output.
Dataset<Row> dataset = spark.read().json("local/foods.json");
dataset.createOrReplaceTempView("cs_food");
List<GenericAnalyticsEntry> menu_distribution= spark
.sql(" ****REQUESTED QUERY ****")
.toJavaRDD()
.map(row -> Triple.of( row.getString(0), BigDecimal.valueOf(row.getLong(1)), BigDecimal.valueOf(row.getLong(2))))
.map(GenericAnalyticsEntry::of)
.collect();
writeObjectAsJsonToHDFS(fs, "/local/output/menu_distribution_new.json", menu_distribution);
The query I'm looking for is based on this structure:
+------------+-------------+------------+------------+
| FIRST_FOOD | SECOND_FOOD | DATE | IS_SPECIAL |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 11/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Lasagna | Pizza | 12/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Spaghetti | 13/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pizza | Spaghetti | 14/02/2017 | TRUE |
+------------+-------------+------------+------------+
| Spaghetti | Lasagna | 15/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Pork | Mozzarella | 16/02/2017 | FALSE |
+------------+-------------+------------+------------+
| Lasagna | Mozzarella | 17/02/2017 | FALSE |
+------------+-------------+------------+------------+
How can I achieve this (written below) output from the code written above?
+------------+--------------------+----------------------+
| FOODS | occurrences(First) | occurrences (Second) |
+------------+--------------------+----------------------+
| Pizza | 2 | 1 |
+------------+--------------------+----------------------+
| Lasagna | 2 | 1 |
+------------+--------------------+----------------------+
| Spaghetti | 2 | 3 |
+------------+--------------------+----------------------+
| Mozzarella | 0 | 2 |
+------------+--------------------+----------------------+
| Pork | 1 | 0 |
+------------+--------------------+----------------------+
I've of course tried to figure out a solution by myself but had no luck with the my tries, I may be wrong, but I need something like this:
"SELECT (first_food + second_food) as menu, COUNT(first_food), COUNT(second_food) from cs_food GROUP BY menu"
From the example data, this looks like it will produce the output you want:
select
foods,
first_count,
second_count
from
(select first_food as food from menus
union select second_food from menus) as f
left join (
select first_food, count(*) as first_count from menus
group by first_food
) as ff on ff.first_food=f.food
left join (
select second_food, count(*) as second_count from menus
group by second_food
) as sf on sf.second_food=f.food
;
Simple combination of flatMap and groupBy should do the job like this (sorry, can't check if it 100% correct right now):
import spark.sqlContext.implicits._
val df = Seq(("Pizza", "Pasta"), ("Pizza", "Soup")).toDF("first", "second")
df.flatMap({case Row(first: String, second: String) => Seq((first, 1, 0), (second, 0, 1))})
.groupBy("_1")
I'd like to store in a hash table a result set coming from a query execution.
The hash table is something like this
Map<List<String>,List<Object>>
where
List<String>, the hash table key, is a subset of the extracted fields
Object is a Java object corresponding to a database tuple (all fields)
So, first, data have to be grouped in order to create each key and group all the items sharing this key.
The pseudo-code related to my current approach is:
while(iterate){
while(rs.next){
if(key is empty)
// build REFERENCE KEY and delete rs entry
else
// build key for i-th rs entry and compare it with the REFERENCE key.
Eventually, get data and delete rs entry
}
rs.beforeFirst()
}
In other words, the result set is iterated many times and each time a new key is created, in order to compare the ramaining result set entries with it. Each time the processed entry is deleted to exit the outer loop.
Since the result set is very large (and also each List(Object) ), performance are poor (a very high loading time per key).
Appending an order by clause to the query (in order to preliminarily group data) doesn't alleviate the problem.
Is there a more efficient approach?
Thanks everyone.
EDIT
Input ResultSet
---------------------------------------------------------------
| Field1 | Field2 | Field3 | Field4 | Field5 | Field6 | Field7 |
---------------------------------------------------------------
| X | A | val1_3 | val1_4 | val1_5 | val1_6 | val1_7 |
| X | A | val2_3 | val2_4 | val2_5 | val2_6 | val2_7 |
| Y | B | val3_3 | val3_4 | val3_5 | val3_6 | val3_7 |
| Z | C | val4_3 | val4_4 | val4_5 | val4_6 | val4_7 |
| Y | D | val5_3 | val5_4 | val5_5 | val5_6 | val5_7 |
----------------------------------------------------------------
Key_Fields : [Field1, Field2]
Output Map
-----------------------------------
| KEY | VALUE |
-----------------------------------
| [X,A] | [Object1, Object2] |
| [Y,B] | [Object3] |
| [Z,C] | [Object4] |
| [Y,D] | [Object5] |
-----------------------------------
I'm using List<String> for key because another ResultSet can have a Key_Fields of different lenght.
Here, my current time-consuming Java code
while(itera){
key = new ArrayList<String>();
values = new ArrayList<AbstractClass>();
while(rs.next()){
if(key.isEmpty()){
// build REFERENCE KEY
// add first OBJECT to List<AbstractClass>
// delete this data from ResultSet
}
else{
// Build KEY_TO_BE_COMPARED
List<String> row_to_be_compared = new ArrayList<String>();
// If this key equals to REFERENCE KEY
if(row_to_be_compared.equals(key)){
AbstractClass value_object = new AbstractClass();
...
rs.deleteRow();
}
// ORDERBY clause in query ensures that, if keys don't match, then all objects related to REFERENCE KEY have been collected
else{
break;
}
}
}
rs.beforeFirst();
map.put(key, values);
if(!rs.next() || items_loaded==max_hash_size)
itera = false;
else
rs.beforeFirst();
}
}
Instead of using List as key. Use a class having List as its instance variable. Override equals very carefully.
Why don't you simplify your key and make it a String that contains all the concatenated fields concatenated by a special character (say .)?