I have dataframe , wanted to convert into JSON ARRAY Please find the example below
Dataframe
+------------+--------------------+----------+----------------+------------------+--------------
| Name| id|request_id|create_timestamp|deadline_timestamp|
+------------+--------------------+----------+----------------+------------------+--------------
| Freeform|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
| D23|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
| Stores|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
|VacationClub|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
Wanted in Json Like below:
[
{
"testname":"xyz",
"systemResponse":[
{
"name":"FGH",
"id":"59bbe3ad-f487-44",
"request_id":1590791280,
"create_timestamp":1590799280
},
{
"name":"FGH",
"id":"59bbe3ad-f487-44",
"request_id":1590791280,
"create_timestamp":1590799280,
}
]
}
]
You can define 2 beans
Create Array from the 1st DF as Array of inner Beans
Define a parent bean with testname and requestDetailArray as Array
Please also find code inline comments
object DataToJsonArray {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load you dataframe
val requestDetailArray = List(
("Freeform", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("D23", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("Stores", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("VacationClub", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556")
).toDF
//Map your Dataframe to RequestDetails bean
.map(row => RequestDetails(row.getString(0), row.getString(1), row.getString(2), row.getString(3), row.getString(4)))
//Collect it as Array
.collect()
//Create another data frme with List[BaseClass] and set the (testname,Array[RequestDetails])
List(BaseClass("xyz", requestDetailArray)).toDF()
.write
//Output your Dataframe as JSON
.json("/json/output/path")
}
}
case class RequestDetails(Name: String, id: String, request_id: String, create_timestamp: String, deadline_timestamp: String)
case class BaseClass(testname: String = "xyz", systemResponse: Array[RequestDetails])
Check below code.
import org.apache.spark.sql.functions._
df.withColumn("systemResponse",
array(
struct("id","request_id","create_timestamp","deadline_timestamp").as("data")
)
)
.select("systemResponse")
.toJSON
.select(col("value").as("json_data"))
.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
+-----------------------------------------------------------------------------------------------------------------------------------------------+
Updated
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn("systemResponse",
array(
struct("id","request_id","create_timestamp","deadline_timestamp").as("data")
)
)
.withColumn("testname",lit("xyz"))
.select("testname","systemResponse")
.toJSON
.select(col("value").as("json_data"))
.show(false)
// Exiting paste mode, now interpreting.
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
Related
I want to parametrice when i have a header and then separator when I read a csv from Spark. I've written this:
DataFrameReader dataFrameReader = spark.read();
dataFrameReader = "csv".equalsIgnoreCase(params.getReadFileType()) ?
dataFrameReader
.option("sep",params.getDelimiter())
.option("header",params.isHeader())
:dataFrameReader;
I'm new to Groovy and I don't get dataFrameReader.option corrected mocked.
DataFrameReader dfReaderLoader = Mock(DataFrameReader)
DataFrameReader dfReaderOptionString = Mock(DataFrameReader)
DataFrameReader dfReaderOptionBoolean = Mock(DataFrameReader)
SparkSession sparkSession = Mock(SparkSession)
sparkSession.read() >> dfReaderLoader
dfReaderLoader.option(_ as String, _ as String) >> dfReaderOptionString
dfReaderOptionString.option(_ as String, _ as Boolean) >> dfReaderOptionBoolean
And it gives me a null pointer exception.
java.lang.NullPointerException: Cannot invoke
"org.apache.spark.sql.DataFrameReader.option(String, boolean)" because
the return value of
"org.apache.spark.sql.DataFrameReader.option(String, String)" is null
I do not know what your problem is, but my guess is that you create the mocks, but then do not inject them into your class under test. If you do that, both your own version as well as Leonard's suggested improved version with a default response work:
Class under test + helper class:
class UnderTest {
SparkSession spark
Parameters params
DataFrameReader produce() {
DataFrameReader dataFrameReader = spark.read()
dataFrameReader = "csv".equalsIgnoreCase(params.getReadFileType()) ?
dataFrameReader
.option("sep", params.getDelimiter())
.option("header", params.isHeader())
: dataFrameReader
}
}
class Parameters {
String readFileType
String delimiter
boolean header
}
Spock specification:
package de.scrum_master.stackoverflow.q74923254
import org.apache.spark.sql.DataFrameReader
import org.apache.spark.sql.SparkSession
import org.spockframework.mock.MockUtil
import spock.lang.Specification
class DataFrameReaderTest extends Specification {
def 'read #readFileType data'() {
given:
DataFrameReader dfReaderLoader = Mock(DataFrameReader)
DataFrameReader dfReaderOptionString = Mock(DataFrameReader)
DataFrameReader dfReaderOptionBoolean = Mock(DataFrameReader)
SparkSession sparkSession = Mock(SparkSession)
sparkSession.read() >> dfReaderLoader
dfReaderLoader.option(_ as String, _ as String) >> dfReaderOptionString
dfReaderOptionString.option(_ as String, _ as Boolean) >> dfReaderOptionBoolean
def underTest = new UnderTest(spark: sparkSession, params: parameters)
expect:
underTest.produce().toString().contains(returnedMockName)
where:
readFileType | parameters | returnedMockName
'CSV' | new Parameters(readFileType: readFileType, delimiter: ';', header: true) | 'dfReaderOptionBoolean'
'XLS' | new Parameters(readFileType: readFileType) | 'dfReaderLoader'
}
def 'read #readFileType data (improved)'() {
given:
SparkSession sparkSession = Mock() {
read() >> Mock(DataFrameReader) {
_ >> _
}
}
def parameters = new Parameters(readFileType: readFileType, delimiter: ';', header: true)
def underTest = new UnderTest(spark: sparkSession, params: parameters)
expect:
new MockUtil().isMock(underTest.produce())
where:
readFileType << ['CSV', 'XLS']
}
}
Try it in the Groovy Web Console.
The result should look similar to this in your IDE:
DataFrameReaderTest ✔
├─ read #readFileType data ✔
│ ├─ read CSV data ✔
│ └─ read XLS data ✔
└─ read #readFileType data (improved) ✔
├─ read CSV data (improved) ✔
└─ read XLS data (improved) ✔
If you don't really care about the intermediate invocations of a builder pattern, i.e. an object that returns itself. I'd suggest to use a Stub, which will return itself, if the method return type matches it's type, or you can use this declaration _ >> _ to achieve the same for Mocks.
given:
ThingBuilder builder = Mock() {
_ >> _
}
when:
Thing thing = builder
.id("id-42")
.name("spock")
.weight(100)
.build()
then:
1 * builder.build() >> new Thing(id: 'id-1337') // <-- only assert the last call you actually care about
thing.id == 'id-1337'
Try it in the Groovy Web Console.
That being said, the error would probably go away if you just remove the as String cast of the second argument of option, or fix it to be as Boolean as the error suggests.
The error was in the params.I was not sending Delimiter or Header so it gave error.
I have a custom function which is depended on the order of the data. I want to apply this function for each group in spark in parallel (parallel groups). How can I do?
For example,
public ArrayList<Integer> my_logic(ArrayList<Integer> glist) {
Boolean b = true;
ArrayList<Integer> result = new ArrayList<>();
for (int i=1; i<glist.size();I++) { // Size is around 30000
If b && glist[i-1] > glist[i] {
// some logic then set b to false
result.add(glist[i]);
} else {
// some logic then set b to true
}
}
return result;
}
My data,
Col1 Col2
a 1
b 2
a 3
c 4
c 3
…. ….
I want something similar to below
df.group_by(col(“Col1”)).apply(my_logic(col(“Col2”)));
// output
a [1,3,5…]
b [2,5,8…]
…. ….
In Spark, you can use Window Aggregate Functions directly, I will show that here in Scala.
Here is your input data (my preparation):
import scala.collection.JavaConversions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val schema = StructType(
StructField("Col1", StringType, false) ::
StructField("Col2", IntegerType, false) :: Nil
)
val row = Seq(Row("a", 1),Row("b", 8),Row("b", 2),Row("a", 5),Row("b", 5),Row("a", 3))
val df = spark.createDataFrame(row, schema)
df.show(false)
//input:
// +----+----+
// |Col1|Col2|
// +----+----+
// |a |1 |
// |b |8 |
// |b |2 |
// |a |5 |
// |b |5 |
// |a |3 |
// +----+----+
Here is the code to obtain desired logic :
import org.apache.spark.sql.expressions.Window
df
// NEWCOLUMN: EVALUATE/CREATE LIST OF VALUES FOR EACH RECORD OVER THE WINDOW AS FRAME MOVES
.withColumn(
"collected_list",
collect_list(col("Col2")) over Window
.partitionBy(col("Col1"))
.orderBy(col("Col2"))
)
// NEWCOLUMN: MAX SIZE OF COLLECTED LIST IN EACH WINDOW
.withColumn(
"max_size",
max(size(col("collected_list"))) over Window.partitionBy(col("Col1"))
)
// FILTER TO GET ONLY HIGHEST SIZED ARRAY ROW
.where(col("max_size") - size(col("collected_list")) === 0)
.orderBy(col("Col1"))
.drop("Col2", "max_size")
.show(false)
// output:
// +----+--------------+
// |Col1|collected_list|
// +----+--------------+
// |a |[1, 3, 5] |
// |b |[2, 5, 8] |
// +----+--------------+
Note:
you can just use collect_list() Aggregate function with groupBy directly but, you can not get the collection list ordered.
collect_set() Aggregate function you can explore if you want to eliminate duplicates (with some changes to the above query).
EDIT 2 : You can write your custom collect_list() as a UDAF (UserDefinedAggregateFunction) like this in Scala Spark for DataFrames
Online Docs
For Spark2.3.0
For Latest Version
Below Code Spark Version == 2.3.0
object Your_Collect_Array extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(
StructField("yourInputToAggFunction", LongType, false) :: Nil
)
override def dataType: ArrayType = ArrayType(LongType, false)
override def deterministic: Boolean = true
override def bufferSchema: StructType = {
StructType(
StructField("yourCollectedArray", ArrayType(LongType, false), false) :: Nil
)
}
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = new Array[Long](0)
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer.update(
0,
buffer.getAs[mutable.WrappedArray[Long]](0) :+ input.getLong(0)
)
}
override def merge(
buffer1: MutableAggregationBuffer,
buffer2: Row
): Unit = {
buffer1.update(
0,
buffer1.getAs[mutable.WrappedArray[Long]](0) ++ buffer2
.getAs[mutable.WrappedArray[Long]](0)
)
}
override def evaluate(buffer: Row): Any =
buffer.getAs[mutable.WrappedArray[Long]](0)
}
//Below is the query with just one line change i.e., calling above written custom udf
df
// NEWCOLUMN : USING OUR CUSTOM UDF
.withColumn(
"your_collected_list",
Your_Collect_Array(col("Col2")) over Window
.partitionBy(col("Col1"))
.orderBy(col("Col2"))
)
// NEWCOLUMN: MAX SIZE OF COLLECTED LIST IN EACH WINDOW
.withColumn(
"max_size",
max(size(col("your_collected_list"))) over Window.partitionBy(col("Col1"))
)
// FILTER TO GET ONLY HIGHEST SIZED ARRAY ROW
.where(col("max_size") - size(col("your_collected_list")) === 0)
.orderBy(col("Col1"))
.drop("Col2", "max_size")
.show(false)
//Output:
// +----+-------------------+
// |Col1|your_collected_list|
// +----+-------------------+
// |a |[1, 3, 5] |
// |b |[2, 5, 8] |
// +----+-------------------+
Note:
UDFs are not that efficient in spark hence, use them only when you absolutely need them. They are mainly focused for data analytics.
I have to pivot the data in a file and then store it in another file. I am having some difficulty pivoting the data.
I have multiple files, that contain data which looks somewhat like I show below. The columns are variable lengths. I am trying to merge the files, first. But for some reason, the output is not correct. I haven't even tried the pivot method, but am not sure how to use it either.
How can this be achieved?
File 1:
0,26,27,30,120
201008,100,1000,10,400
201009,200,2000,20,500
201010,300,3000,30,600
File 2:
0,26,27,30,120,145
201008,100,1000,10,400,200
201009,200,2000,20,500,100
201010,300,3000,30,600,150
File 3:
0,26,27,120,145
201008,100,10,400,200
201009,200,20,500,100
201010,300,30,600,150
Output:
201008,26,100
201008,27,1000
201008,30,10
201008,120,400
201008,145,200
201009,26,200
201009,27,2000
201009,30,20
201009,120,500
201009,145,100
.....
I am not quite familiar with Spark, but am trying to use flatMap and flatMapValues. I am not sure how I can use it for now, but would appreciate some guidance.
import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.SparkSession;
import lombok.extern.slf4j.Slf4j;
#Slf4j
public class ExecutionTest {
public static void main(String[] args) {
Logger.getLogger("org.apache").setLevel(Level.WARN);
Logger.getLogger("org.spark_project").setLevel(Level.WARN);
Logger.getLogger("io.netty").setLevel(Level.WARN);
log.info("Starting...");
// Step 1: Create a SparkContext.
boolean isRunLocally = Boolean.valueOf(args[0]);
String filePath = args[1];
SparkConf conf = new SparkConf().setAppName("Variable File").set("serializer",
"org.apache.spark.serializer.KryoSerializer");
if (isRunLocally) {
log.info("System is running in local mode");
conf.setMaster("local[*]").set("spark.executor.memory", "2g");
}
SparkSession session = SparkSession.builder().config(conf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
jsc.textFile(filePath, 2)
.map(new Function<String, String[]>() {
private static final long serialVersionUID = 1L;
#Override
public String[] call(String v1) throws Exception {
return StringUtils.split(v1, ",");
}
})
.foreach(new VoidFunction<String[]>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String[] t) throws Exception {
for (String string : t) {
log.info(string);
}
}
});
}
}
Solution in Scala as I am not a JAVA person, you should be able to adapt. And add sorting, cache, etc.
Data is as follows, 3 files with duplicate entry evident, get rid of that if you do not want.
0, 5,10, 15 20
202008, 5,10, 15, 20
202009,10,20,100,200
8 rows generated above.
0,888,999
202008, 5, 10
202009, 10, 20
4 rows generated above.
0, 5
202009,10
1 row, which is a duplicate.
// Bit lazy with columns names, but anyway.
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val inputPath: String = "/FileStore/tables/g*.txt"
val rdd = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)]
.rdd
val rdd2 = rdd.zipWithIndex
val rdd3 = rdd2.map(x => (x._1._1, x._2, x._1._2.split(",").toList.map(_.toInt)))
val rdd4 = rdd3.map { case (pfx, pfx2, list) => (pfx,pfx2,list.zipWithIndex) }
val df = rdd4.toDF()
df.show(false)
df.printSchema()
val df2 = df.withColumn("rankF", row_number().over(Window.partitionBy($"_1").orderBy($"_2".asc)))
df2.show(false)
df2.printSchema()
val df3 = df2.withColumn("elements", explode($"_3"))
df3.show(false)
df3.printSchema()
val df4 = df3.select($"_1", $"rankF", $"elements".getField("_1"), $"elements".getField("_2")).toDF("fn", "line_num", "val", "col_pos")
df4.show(false)
df4.printSchema()
df4.createOrReplaceTempView("df4temp")
val df51 = spark.sql("""SELECT hdr.fn, hdr.line_num, hdr.val AS pfx, hdr.col_pos
FROM df4temp hdr
WHERE hdr.line_num <> 1
AND hdr.col_pos = 0
""")
df51.show(100,false)
val df52 = spark.sql("""SELECT t1.fn, t1.val AS val1, t1.col_pos, t2.line_num, t2.val AS val2
FROM df4temp t1, df4temp t2
WHERE t1.col_pos <> 0
AND t1.col_pos = t2.col_pos
AND t1.line_num <> t2.line_num
AND t1.line_num = 1
AND t1.fn = t2.fn
""")
df52.show(100,false)
df51.createOrReplaceTempView("df51temp")
df52.createOrReplaceTempView("df52temp")
val df53 = spark.sql("""SELECT DISTINCT t1.pfx, t2.val1, t2.val2
FROM df51temp t1, df52temp t2
WHERE t1.fn = t2.fn
AND t1.line_num = t2.line_num
""")
df53.show(false)
returns:
+------+----+----+
|pfx |val1|val2|
+------+----+----+
|202008|888 |5 |
|202009|999 |20 |
|202009|20 |200 |
|202008|5 |5 |
|202008|10 |10 |
|202009|888 |10 |
|202008|15 |15 |
|202009|5 |10 |
|202009|10 |20 |
|202009|15 |100 |
|202008|20 |20 |
|202008|999 |10 |
+------+----+----+
What we see is Data Wrangling requiring massaged data for tempview creations and JOINing with SQL appropriately.
The key here is to know how to massage the data to make things easy. Note no groupBy etc. Per file, with varying length stuff, JOINing not attempted in RDD, too inflexible. Rank shows line#, so you know the first line with the 0 business.
This is what we call Data Wrangling. This is what we also call hard work for a few points on SO. This is one of my best efforts, and also one of the last of such efforts.
Weakness of solution is a lot of work to get 1st record of a file, there are alternatives. https://www.cyberciti.biz/faq/unix-linux-display-first-line-of-file/ preprocesing is what I would realistically consider.
I have a yaml file that I want to read its contents in scala , so I parse it using io.circe.yaml to json
var js = yaml.parser.parse(ymlText)
var json=js.valueOr(null)
var jsonstring=json.toString
val json2 = parse(jsonstring)
the yamltext is like this:
ALL:
Category1:
Subcategory11 : 1.5
Subcategory12 : 0
Subcategory13 : 0
Subcategory14 : 0.5
Category2:
Subcategory21 : 1.5
Subcategory22 : 0.3
Subcategory23 : 0
Subcategory24 : 0
what I want is to filter the subcategories that has Zero values, I've used this code:
val elements = (json2 \\"ALL" ).children.map(x=>(x.values))
var subCategories=elements.map{case(a,b)=>(b)}
var cats=elements.map{case(a,b)=>(b.asInstanceOf[Map[String,Double]])}
cats.map(x=>x.filter{case(a,b)=>b>0.0})
But the last line gives me this error:
scala.math.BigInt cannot be cast to java.lang.Double
I'm not sure why you do toString + parse and which parse is used but you probably don't need it. Also you didn't describe your expected result so here are a few guesses of what you might need:
import java.io._
import io.circe._
import io.circe.yaml._
import io.circe.parser._
def test(): Unit = {
// test data instead of a file
val ymlText =
"""
|ALL:
| Category1:
| Subcategory11 : 1.5
| Subcategory12 : 0
| Subcategory13 : 0
| Subcategory14 : 0.5
| Category2:
| Subcategory21 : 1.5
| Subcategory22 : 0.3
| Subcategory23 : 0
| Subcategory24 : 0
""".stripMargin
var js = yaml.parser.parse(new StringReader(ymlText))
var json: Json = js.right.get
val categories = (json \\ "ALL").flatMap(j => j.asObject.get.values.toList)
val subs = categories.flatMap(j => j.asObject.get.toList)
val elements: List[(String, Double)] = subs.map { case (k, v) => (k, v.asNumber.get.toDouble) }
.filter {
case (k, v) => v > 0.0
}
println(s"elements: $elements")
val allCategories = (json \\ "ALL").flatMap(j => j.asObject.get.toList).toMap
val filteredTree: Map[String, Map[String, Double]] = allCategories
.mapValues(catJson => catJson.asObject.get.toList.map { case (subName, subJson) => (subName, subJson.asNumber.get.toDouble) }
.filter { case (subName, subValue) => subValue > 0.0 }
.toMap)
println(s"filteredTree : $filteredTree")
}
And the output for that is:
elements: List((Subcategory11,1.5), (Subcategory14,0.5), (Subcategory21,1.5), (Subcategory22,0.3))
filteredTree : Map(Category1 -> Map(Subcategory11 -> 1.5, Subcategory14 -> 0.5), Category2 -> Map(Subcategory21 -> 1.5, Subcategory22 -> 0.3))
Hope one of those version is what you needed.
I am having two fields of java.sql.timestamp type in my dataframe and I want to find number of days between these two column
Below is the format of my data : *2016-12-23 23:56:02.0 (yyyy-MM-dd HH:mm:ss.S)
I had tried lots of method but did not find any solution. So can any one help here.
org.apache.spark.sql.functions is a treasure trove. For example, there is the datediff method that does exactly what you want: here is the ScalaDoc.
An example:
val spark: SparkSession = ??? // your spark session
val sc: SparkContext = ??? // your spark context
import spark.implicits._ // to better work with spark sql
import java.sql.Timestamp
final case class Data(id: Int, from: Timestamp, to: Timestamp)
val ds =
spark.createDataset(sc.parallelize(Seq(
Data(1, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-11 00:00:00")),
Data(2, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-21 00:00:00")),
Data(3, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-23 00:00:00")),
Data(4, Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-07 00:00:00"))
)))
import org.apache.spark.sql.functions._
ds.select($"id", datediff($"from", $"to")).show()
By running this snippet you would end up with the following output:
+---+------------------+
| id|datediff(from, to)|
+---+------------------+
| 1| -10|
| 2| -20|
| 3| -22|
| 4| -6|
+---+------------------+