I have a parquet that looks like :
------------
name | age |
------------
Tom | 12 |
------------
Mary | 15 |
Now I added a column "timestamp" to it using :
final DataFrame dfWithNewColumn = df.withColumn("timestamp", createTimestamp())
and it looks like :
------------ --------------
name | age | timestamp |
---------------------------
Tom | 12 | 1569312845998 |
---------------------------
Mary | 15 | 1569312845998 |
And I write it into a parquet :
dfWithNewColumn.write()
.partitionBy(new String[]{"name","timestamp"})
.mode(SaveMode.Append)
.parquet(parquetPath);
When I look it using spark-shell, it's in good format :
------------ --------------
name | age | timestamp |
---------------------------
Tom | 12 | 1569312845998 |
---------------------------
Mary | 15 | 1569312845998 |
But the problem is, when I read the parquet again :
public static StructType createSchema() {
final StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("age", DataTypes.StringType, false),
DataTypes.createStructField("name", DataTypes.StringType, false),
DataTypes.createStructField("timestamp", DataTypes.LongType, false)
));
return schema;
}
DataFrame df = sqlContext.read()
.schema(createSchema())
.parquet(parquetPath);
When I show the rows df.show(), it becomes :
------------ --------------
age | name | timestamp |
---------------------------
12 | Tom | 171798691853 |
---------------------------
15 | Mary | 171798691853 |
How is that possible ? The parquet file is OK so I assume the problem is in the reading code.
Edit :
I found the cause. This problem happens after I changed spark.sql.sources.partitionColumnTypeInference.enabled=false. How can I deal with it ?
Use Spark out of the box functions current_timestamp() which returns the current timestamp as a timestamp column. And for while reading it should be read as org.apache.spark.sql.types.TimestampType datatype.
//Write
final DataFrame dfWithNewColumn = df.withColumn("timestamp", current_timestamp())
//Read
public static StructType createSchema() {
final StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("age", DataTypes.StringType, false),
DataTypes.createStructField("name", DataTypes.StringType, false),
DataTypes.createStructField("timestamp", DataTypes.TimestampType, false)
));
return schema;
}
Related
I am new to Spark 2.4 with Java 8. I need help. Here is example of instances:
Source DataFrame
+--------------+
| key | Value |
+--------------+
| A | John |
| B | Nick |
| A | Mary |
| B | Kathy |
| C | Sabrina|
| B | George |
+--------------+
Meta DataFrame
+-----+
| key |
+-----+
| A |
| B |
| C |
| D |
| E |
| F |
+-----+
I would like to transform it to the following: Column names from Meta Dataframe and Rows will be transformed based on Source Dataframe
+-----------------------------------------------+
| A | B | C | D | E | F |
+-----------------------------------------------+
| John | Nick | Sabrina | null | null | null |
| Mary | Kathy | null | null | null | null |
| null | George | null | null | null | null |
+-----------------------------------------------+
Need to write a code Spark 2.3 with Java8. Appreciated your help.
To make things clearer (and easily reproducible) let's define dataframes:
val df1 = Seq("A" -> "John", "B" -> "Nick", "A" -> "Mary",
"B" -> "Kathy", "C" -> "Sabrina", "B" -> "George")
.toDF("key", "value")
val df2 = Seq("A", "B", "C", "D", "E", "F").toDF("key")
From what I see, you are trying to create one column by value in the key column of df2. These columns should contain all the values of the value column that are associated to the key naming the column. If we take an example, column A's first value should be the value of the first occurrence of A (if it exists, null otherwise): "John". Its second value should be the value of the second occurrence of A: "Mary". There is no third value so the third value of the column should be null.
I detailed it to show that we need a notion of rank of the values for each key (windowing function), and group by that notion of rank. It would go as follows:
import org.apache.spark.sql.expressions.Window
val df1_win = df1
.withColumn("id", monotonically_increasing_id)
.withColumn("rank", rank() over Window.partitionBy("key").orderBy("id"))
// the id is just here to maintain the original order.
// getting the keys in df2. Add distinct if there are duplicates.
val keys = df2.collect.map(_.getAs[String](0)).sorted
// then it's just about pivoting
df1_win
.groupBy("rank")
.pivot("key", keys)
.agg(first('value))
.orderBy("rank")
//.drop("rank") // I keep here it for clarity
.show()
+----+----+------+-------+----+----+----+
|rank| A| B| C| D| E| F|
+----+----+------+-------+----+----+----+
| 1|John| Nick|Sabrina|null|null|null|
| 2|Mary| Kathy| null|null|null|null|
| 3|null|George| null|null|null|null|
+----+----+------+-------+----+----+----+
Here is the very same code in Java
Dataset<Row> df1_win = df1
.withColumn("id", functions.monotonically_increasing_id())
.withColumn("rank", functions.rank().over(Window.partitionBy("key").orderBy("id")));
// the id is just here to maintain the original order.
// getting the keys in df2. Add distinct if there are duplicates.
// Note that it is a list of objects, to match the (strange) signature of pivot
List<Object> keys = df2.collectAsList().stream()
.map(x -> x.getString(0))
.sorted().collect(Collectors.toList());
// then it's just about pivoting
df1_win
.groupBy("rank")
.pivot("key", keys)
.agg(functions.first(functions.col("value")))
.orderBy("rank")
// .drop("rank") // I keep here it for clarity
.show();
I have a Dataset<Row> in java. I need to read value of 1 column which is a JSON string, parse it, and set the value of a few other columns based on the parsed JSON value.
My dataset looks like this:
|json | name| age |
========================================
| "{'a':'john', 'b': 23}" | null| null |
----------------------------------------
| "{'a':'joe', 'b': 25}" | null| null |
----------------------------------------
| "{'a':'zack'}" | null| null |
----------------------------------------
And I need to make it like this:
|json | name | age |
========================================
| "{'a':'john', 'b': 23}" | 'john'| 23 |
----------------------------------------
| "{'a':'joe', 'b': 25}" | 'joe' | 25 |
----------------------------------------
| "{'a':'zack'}" | 'zack'|null|
----------------------------------------
I am unable to figure out a way to do it. Please help with the code.
There is a function get_json_object exists in Spark.
Suggesting, you have a data frame named df, you may choose this way to solve your problem:
df.selectExpr("get_json_object(json, '$.a') as name", "get_json_object(json, '$.b') as age" )
But first and foremost, be sure that your json attribute has double quotes instead of single ones.
Note: there is a full list of Spark SQL functions. I am using it heavily. Consider to add it to bookmarks and reference time to time.
You could use UDFs
def parseName(json: String): String = ??? // parse json
val parseNameUDF = udf[String, String](parseName)
def parseAge(json: String): Int = ??? // parse json
val parseAgeUDF = udf[Int, String](parseAge)
dataFrame
.withColumn("name", parseNameUDF(dataFrame("json")))
.withColumn("age", parseAgeUDF(dataFrame("json")))
I have a Dataset like below
Dataset<Row> dataset = ...
dataset.show()
| NAME | DOB |
+------+----------+
| John | 19801012 |
| Mark | 19760502 |
| Mick | 19911208 |
I want to convert it to below (formatted DOB)
| NAME | DOB |
+------+------------+
| John | 1980-10-12 |
| Mark | 1976-05-02 |
| Mick | 1991-12-08 |
How can I do this? Basically, I am trying to figure out how to manipulate existing column string values in a generic way.
I tried using dataset.withColumn but couldn't quite figure out how to achieve this.
Appreciate any help.
With "substring" and "concat" functions:
df.withColumn("DOB_FORMATED",
concat(substring($"DOB", 0, 4), lit("-"), substring($"DOB", 5, 2), lit("-"), substring($"DOB", 7, 2)))
Load the data into a dataframe(deltaData) and just use the following line
deltaData.withColumn("DOB", date_format(to_date($"DOB", "yyyyMMdd"), "yyyy-MM-dd")).show()
Assuming DOB is a String you could write a UDF
def formatDate(s: String): String {
// date formatting code
}
val formatDateUdf = udf(formatDate(_: String))
ds.select($"NAME", formatDateUdf($"DOB").as("DOB"))
I have 2 datasets with me as shown below. I'm trying to find out how many products are associated with each game. Basically, I'm trying to keep a count of the number of products associated.
scala> df1.show()
gameid | games | users | cnt_assoc_prod
-------------------------------------------
1 | cricket |[111, 121] |
2 | basketball|[211] |
3 | skating |[101, 100, 98] |
scala> df2.show()
user | products
----------------------
98 | "shampoo"
100 | "soap"
101 | "shampoo"
111 | "shoes"
121 | "honey"
211 | "shoes"
I'm trying to iterate through each of df1's users from the array and find the corresponding row in df2 by applying the filter on column matching the user.
df1.map{x => {
var assoc_products = new Set()
x.users.foreach(y => assoc_products + df2.filter(z => z.user == y).first().
products)
x.cnt_assoc_prod = assoc_products.size
}
While applying filter I get following Exception
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.logicalPlan(Dataset.scala:784)
at org.apache.spark.sql.Dataset.mapPartitions(Dataset.scala:344)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:307)
I'm using spark version 1.6.1.
You can explode the users column in df1, join with df2 on the user column, then do the groupBy count:
(df1.withColumn("user", explode(col("users")))
.join(df2, Seq("user"))
.groupBy("gameid", "games")
.agg(count($"products").alias("cnt_assoc_prod"))
).show
+------+----------+--------------+
|gameid| games|cnt_assoc_prod|
+------+----------+--------------+
| 3| skating| 3|
| 2|basketball| 1|
| 1| cricket| 2|
+------+----------+--------------+
Giving the following input table:
+----+------------+----------+
| id | shop | purchases|
+----+------------+----------+
| 1 | 01 | 20 |
| 1 | 02 | 31 |
| 2 | 03 | 5 |
| 1 | 03 | 3 |
+----+------------+----------+
I would like, grouping by id and based on the purchases, obtain the first 2 top shops as follow:
+----+-------+------+
| id | top_1 | top_2|
+----+-------+------+
| 1 | 02 | 01 |
| 2 | 03 | |
+----+-------+------+
I'm using Apache Spark 2.0.1 and the first table is the result of other queries and joins which are on a Dataset. I could maybe do this with the traditional java iterating over the Dataset, but I hope there is another way using the Dataset functionalities.
My first attempt was the following:
//dataset is already ordered by id, purchases desc
...
Dataset<Row> ds = dataset.repartition(new Column("id"));
ds.foreachPartition(new ForeachPartitionFunction<Row>() {
#Override
public void call(Iterator<Row> itrtr) throws Exception {
int counter = 0;
while (itrtr.hasNext()) {
Row row = itrtr.next();
if(counter < 2)
//save it into another Dataset
counter ++;
}
}
});
But then I were lost in how to save it into another Dataset. My goal is, at the end, save the result into a MySQL table.
Using window functions and pivot you can define a window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, row_number}
val w = Window.partitionBy(col("id")).orderBy(col("purchases").desc)
add row_number and filter top two rows:
val dataset = Seq(
(1, "01", 20), (1, "02", 31), (2, "03", 5), (1, "03", 3)
).toDF("id", "shop", "purchases")
val topTwo = dataset.withColumn("top", row_number.over(w)).where(col("top") <= 2)
and pivot:
topTwo.groupBy(col("id")).pivot("top", Seq(1, 2)).agg(first("shop"))
with result being:
+---+---+----+
| id| 1| 2|
+---+---+----+
| 1| 02| 01|
| 2| 03|null|
+---+---+----+
I'll leave converting syntax to Java as an exercise for the poster (excluding import static for functions the rest should be close to identical).