I have a column in spark dataframe as
time_span
values are in iso 8601 duration
ex: P0Y0M0DT0H5M35S . I want to convert that values in to seconds. Is there a function in spark or Scala which will help me do that? I am looking for a way and was unsuccessful
I tried with duration
import java.time.Duration
java.time.Duration.parse("P0Y0M0DT0H5M35S")
This gives me err as:
java.time.format.DateTimeParseException: Text cannot be parsed to a Duration
Am I doing anything wrong in passing value to function. I found this documentation
https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html
If I was successful in doing it this way then will have to apply additional logic to do it on whole dataframe column
hope the below approach helps you.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val isoToSecondsUDF = udf( (value: String) => (java.time.Duration.parse("PT".concat(value.split("T")(1))).get(java.time.temporal.ChronoUnit.SECONDS)))
val df=Seq(("P0Y0M0DT0H5M35S")).toDF("value")
df.withColumn("seconds",isoToSecondsUDF($"value")).show()
/*
+---------------+-------+
| value|seconds|
+---------------+-------+
|P0Y0M0DT0H5M35S| 335|
+---------------+-------+
*/
Updated Solution to cover case where month and day is present
for eg: P0Y0M2DT23H59M56S. and P0Y1M2DT23H59M56S
We will need to use time4j lib : https://github.com/MenoData/Time4J
Here is code :
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import net.time4j.Duration
def getSeconds(value: String) : String={
var b = Duration.parsePeriod(value).toTemporalAmount().get(java.time.temporal.ChronoUnit.MONTHS)
var c = Duration.parsePeriod(value).toTemporalAmount().get(java.time.temporal.ChronoUnit.DAYS)
var days =((b*30)+c).toString()
var seconds = (java.time.Duration.parse("P".concat(days).concat("DT").concat(if(value.contains("T")) value.split("T")(1) else value.split("D")(1))).get(java.time.temporal.ChronoUnit.SECONDS)).toString()
return seconds
}
val isoToSecondsUDF = udf( (value: String) => getSeconds(value))
spark.udf.register("isoToSecondsUDF", isoToSecondsUDF)
val df=Seq(("P0Y0M2DT23H59M56S")).toDF("value")
df.withColumn("seconds",isoToSecondsUDF($"value")).show()
First get the number of months then convert to days and add it to existing number of days then pass that to parse method.
#sathya
Output:
+-----------------+-------+
| value|seconds|
+-----------------+-------+
|P0Y0M2DT23H59M56S| 259196|
+-----------------+-------+
+-----------------+-------+
| value|seconds|
+-----------------+-------+
|P0Y1M2DT23H59M56S|2851196|
+-----------------+-------+
Related
I have a dataframe with one column that I change to string through the function date_format.
lrPredictions.filter("label > 0").selectExpr("item_id",
"horizon_minutes",
"date_format(date_time, '1970-01-01 HH:mm:ss')" + " AS datetime_from",
"abs(prediction - label) AS error_abs_sum", // these are all the error_abs_sum, error_squ_sum and so on...
"power(prediction - label,2) AS error_squ_sum",
"100 * abs(prediction - label) / label AS error_per_sum",
"abs(last_value - label) AS delta_sum")
However I want to take that same column back to TimestampType as I need to dump the dataframe in a DB with a date column.
How can I do that?. I have not found any function or example in java.
Depending on what format the date is available, you can use below and specify the format in which you're providing string to get timestamp.
Note that I have imported types from Spark. This code is in Scala, but Java code should be similar.
import org.apache.spark.sql.types._
val df = sc.parallelize(List("2018-08-11 11:44:50", "2019-09-11 11:20:00")).toDF
import org.apache.spark.sql.functions._
val df2 = df.select(unix_timestamp(col("value"))cast(TimestampType))
If you look at schema of df2, it will be timestamp type
root
|-- CAST(unix_timestamp(value, yyyy-MM-dd HH:mm:ss) AS TIMESTAMP): timestamp (nullable = true)
You can try something like this (code is in scala but it shouldnt matter in this case):
tmp.createTempView("temp_3")
tmp.show
+-------------------+---+---+
| ts| b| c|
+-------------------+---+---+
|1970-01-01 12:00:00|0.3|0.4|
|2014-01-01 12:00:00|0.1|0.4|
|2019-01-03 15:30:05|0.2|0.5|
+-------------------+---+---+
spark.sql("SELECT unix_timestamp(ts) as ts FROM temp_3").show
+----------+
| ts|
+----------+
| 43200|
|1388577600|
|1546529405|
+----------+
I have a DataFrame containing below:
TradeId|Source
ABC|"USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602"
I want to pivot this data so it turns into below
TradeId|CCY|PV
ABC|USD|333.123
ABC|USD|-789.444
ABC|GBP|1234.567
The number of CCY|PV|Date triplets in the column "Source" is not fixed. I could do it in ArrayList but that requires to load the data in JVM and defeats the whole point of Spark.
Lets say my DataFrame looks as below:
DataFrame tradesSnap = this.loadTradesSnap(reportRequest);
String tempTable = getTempTableName();
tradesSnap.registerTempTable(tempTable);
tradesSnap = tradesSnap.sqlContext().sql("SELECT TradeId, Source FROM " + tempTable);
If you read databricks pivot, it says " A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns." And this is not what you desire I guess
I would suggest you to use withColumn and functions to get the final output you desire. You can do as following considering dataframe is what you have
+-------+----------------------------------------------------------------+
|TradeId|Source |
+-------+----------------------------------------------------------------+
|ABC |USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602|
+-------+----------------------------------------------------------------+
You can do the following using explode, split and withColumn to get the desired output
val explodedDF = dataframe.withColumn("Source", explode(split(col("Source"), "\\|")))
val finalDF = explodedDF.withColumn("CCY", split($"Source", ",")(0))
.withColumn("PV", split($"Source", ",")(1))
.withColumn("Date", split($"Source", ",")(2))
.drop("Source")
finalDF.show(false)
The final output is
+-------+---+--------+--------+
|TradeId|CCY|PV |Date |
+-------+---+--------+--------+
|ABC |USD|333.123 |20170605|
|ABC |USD|-789.444|20170605|
|ABC |GBP|1234.567|20150602|
+-------+---+--------+--------+
I hope this solves your issue
Rather than pivoting, what you are trying to achieve looks more like flatMap.
To put it simply, by using flatMap on a Dataset you apply to each row a function (map) that itself would produce a sequence of rows. Each set of rows is then concatenated into a single sequence (flat).
The following program shows the idea:
import org.apache.spark.sql.SparkSession
case class Input(TradeId: String, Source: String)
case class Output(TradeId: String, CCY: String, PV: String, Date: String)
object FlatMapExample {
// This function will produce more rows of output for each line of input
def splitSource(in: Input): Seq[Output] =
in.Source.split("\\|", -1).map {
source =>
println(source)
val Array(ccy, pv, date) = source.split(",", -1)
Output(in.TradeId, ccy, pv, date)
}
def main(args: Array[String]): Unit = {
// Initialization and loading
val spark = SparkSession.builder().master("local").appName("pivoting-example").getOrCreate()
import spark.implicits._
val input = spark.read.options(Map("sep" -> "|", "header" -> "true")).csv(args(0)).as[Input]
// For each line in the input, split the source and then
// concatenate each "sub-sequence" in a single `Dataset`
input.flatMap(splitSource).show
}
}
Given your input, this would be the output:
+-------+---+--------+--------+
|TradeId|CCY| PV| Date|
+-------+---+--------+--------+
| ABC|USD| 333.123|20170605|
| ABC|USD|-789.444|20170605|
| ABC|GBP|1234.567|20150602|
+-------+---+--------+--------+
You can now take the result and save it to a CSV, if you want.
After parsing JSON UTC date-time data from a server, I was presented with
2017-03-27 16:27:45.567
... is there any way to format this without using tedious amount of String manipulation so that the seconds part is rounded up to 46 prior to passing it in as a DateTimeFormat pattern of say, "yyyy-MM-dd HH:mm:ss"?
You can round the second up like this:
DateTime dateTime = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSS")
.withZoneUTC()
.parseDateTime("2017-03-27 16:27:45.567")
.secondOfMinute()
.roundCeilingCopy();
System.out.println(dateTime);
// 2017-03-27T16:27:46.000Z
Have you looked at (and could you use) the MomentJS library? I had issues with reading various date formats from the server and making sense of them in JavaScript code (which led me here). Since then, I've used MomentJS and working with dates/times in JavaScript has been much easier.
Here is an example:
<script>
try
{
var myDateString = "2017-03-27 16:27:45.567";
var d = moment(myDateString);
var result = d.format('YYYY/MM/DD HH:mm:ss');
alert("Simple Format: " + result);
// If we have millliseconds, increment to the next second so that
// we can then get its 'floor' by using the startOf() function.
if(d.millisecond() > 0)
d = d.add(1, 'second');
result = d.startOf('second').format('YYYY/MM/DD HH:mm:ss');
alert("Rounded Format: " + result);
}
catch(er)
{
console.log(er);
}
</script>
But of course, you'll probably want to wrap this logic into a function.
There is input_file_name function in Apache Spark which is used by me to add new column to Dataset with the name of file which is currently being processed.
The problem is that I'd like to somehow customize this function to return only file name, ommitting the full path to it on s3.
For now, I am doing replacement of the path on the second step using map function:
val initialDs = spark.sqlContext.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path).withColumn("input_file_name", input_file_name)
...
...
def fromFile(fileName: String): String = {
val baseName: String = FilenameUtils.getBaseName(fileName)
val tmpFileName: String = baseName.substring(0, baseName.length - 8) //here is magic conversion ;)
this.valueOf(tmpFileName)
}
But I'd like to use something like
val initialDs = spark.sqlContext.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path).withColumn("input_file_name", **customized_input_file_name_function**)
In Scala:
#register udf
spark.udf
.register("get_only_file_name", (fullPath: String) => fullPath.split("/").last)
#use the udf to get last token(filename) in full path
val initialDs = spark.read
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path)
.withColumn("input_file_name", get_only_file_name(input_file_name))
Edit: In Java as per comment
#register udf
spark.udf()
.register("get_only_file_name", (String fullPath) -> {
int lastIndex = fullPath.lastIndexOf("/");
return fullPath.substring(lastIndex, fullPath.length - 1);
}, DataTypes.StringType);
import org.apache.spark.sql.functions.input_file_name
#use the udf to get last token(filename) in full path
Dataset<Row> initialDs = spark.read()
.option("dateFormat", conf.dateFormat)
.schema(conf.schema)
.csv(conf.path)
.withColumn("input_file_name", get_only_file_name(input_file_name()));
Borrowing from a related question here, the following method is more portable and does not require a custom UDF.
Spark SQL Code Snippet: reverse(split(path, '/'))[0]
Spark SQL Sample:
WITH sample_data as (
SELECT 'path/to/my/filename.txt' AS full_path
)
SELECT
full_path
, reverse(split(full_path, '/'))[0] as basename
FROM sample_data
Explanation:
The split() function breaks the path into it's chunks and reverse() puts the final item (the file name) in front of the array so that [0] can extract just the filename.
Full Code example here :
spark.sql(
"""
|WITH sample_data as (
| SELECT 'path/to/my/filename.txt' AS full_path
| )
| SELECT
| full_path
| , reverse(split(full_path, '/'))[0] as basename
| FROM sample_data
|""".stripMargin).show(false)
Result :
+-----------------------+------------+
|full_path |basename |
+-----------------------+------------+
|path/to/my/filename.txt|filename.txt|
+-----------------------+------------+
commons io is natural/easiest import in spark means(no need to add additional dependency...)
import org.apache.commons.io.FilenameUtils
getBaseName(String fileName)
Gets the base name, minus the full path and extension, from a full fileName.
val baseNameOfFile = udf((longFilePath: String) => FilenameUtils.getBaseName(longFilePath))
Usage is like...
yourdataframe.withColumn("shortpath" ,baseNameOfFile(yourdataframe("input_file_name")))
.show(1000,false)
I have a case class which I want to serialize first. Then after that, I want to deserialize it for storing purpose in MongoDB but java 8 LocalDateTime was creating problem. I took help from this link:
how to deserialize DateTime in Lift
but with no luck. I am unable to write it for java 8 date time.
Can any one please help me with this date Time issue? Here is my code:
import net.liftweb.json.Serialization.{read, write}
implicit val formats = Serialization.formats(NoTypeHints)
case class Child(var str: String, var Num: Int, var abc: Option[String], MyList: List[Int], val dateTime: LocalDateTime = LocalDateTime.now())
val ser = write(Child("Mary", 5, None, List(1, 2)))
println("Child class converted to string" + ser)
var obj = read[Child](ser)
println("object of Child is " + obj)
And here is the error message printed on the console:
(run-main-0) java.lang.ArrayIndexOutOfBoundsException: 49938
java.lang.ArrayIndexOutOfBoundsException: 49938
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.<init>(BytecodeReadingParanamer.java:451)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.<init>(BytecodeReadingParanamer.java:431)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.<init>(BytecodeReadingParanamer.java:492)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.<init>(BytecodeReadingParanamer.java:337)
at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:100)
at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:75)
at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:68)
at net.liftweb.json.Meta$ParanamerReader$.lookupParameterNames(Meta.scala:89)
at net.liftweb.json.Meta$Reflection$.argsInfo$1(Meta.scala:237)
at net.liftweb.json.Meta$Reflection$.constructorArgs(Meta.scala:253)
at net.liftweb.json.Meta$Reflection$.net$liftweb$json$Meta$Reflection$$findMostComprehensive$1(Meta.scala:266)
at net.liftweb.json.Meta$Reflection$$anonfun$primaryConstructorArgs$1.apply(Meta.scala:269)
at net.liftweb.json.Meta$Reflection$$anonfun$primaryConstructorArgs$1.apply(Meta.scala:269)
at net.liftweb.json.Meta$Memo.memoize(Meta.scala:199)
at net.liftweb.json.Meta$Reflection$.primaryConstructorArgs(Meta.scala:269)
at net.liftweb.json.Extraction$.decompose(Extraction.scala:88)
at net.liftweb.json.Extraction$$anonfun$1.applyOrElse(Extraction.scala:91)
at net.liftweb.json.Extraction$$anonfun$1.applyOrElse(Extraction.scala:89)
at scala.collection.immutable.List.collect(List.scala:305)
at net.liftweb.json.Extraction$.decompose(Extraction.scala:89)
at net.liftweb.json.Serialization$.write(Serialization.scala:38)
at TestActor$.delayedEndpoint$TestActor$1(TestActor.scala:437)
at TestActor$delayedInit$body.apply(TestActor.scala:54)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:383)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at TestActor$.main(TestActor.scala:54)
at TestActor.main(TestActor.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
If I remove the dateTime parameter from case class, it works fine. It seems like the problem is in dateTime.
I ran your code on my Intellij Idea, got the same error. Tried to debug the cause but the invocation stack is so deep that I finally gave up.
But I guess maybe it is because Lift doesn't provide default Format for LocaleDateTime, just like the post you mentioned said, "it is the DateParser format that Lift uses by default."
Here is a compromise for your reference,Lift-JSON provides default Date format for us
// net.liftweb.json.Serialization Line 72
def formats(hints: TypeHints) = new Formats {
val dateFormat = DefaultFormats.lossless.dateFormat
override val typeHints = hints
}
So instead of going all the way to write customized serializer, we may as well change our data type to fit the default Date format. Plus, net.liftweb.mongodb.DateSerializer(Line 79) provides support for Date serialization.
Then, we can provide method to easily get LocaleDateTime. Following is how I try to figure it out.
package jacoffee.scalabasic
import java.time.{ ZoneId, LocalDateTime }
import java.util.Date
// package object defined is for Scala compiler to look for implicit conversion for case class parameter date
package object stackoverflow {
implicit def toDate(ldt: LocalDateTime): Date =
Date.from(ldt.atZone(ZoneId.systemDefault()).toInstant())
implicit def toLDT(date: Date): LocalDateTime =
LocalDateTime.ofInstant(date.toInstant(), ZoneId.systemDefault())
}
package jacoffee.scalabasic.stackoverflow
import java.time.LocalDateTime
import java.util.Date
import net.liftweb.json.{ NoTypeHints, Serialization }
import net.liftweb.json.Serialization.{ read, write }
case class Child(var str: String, var Num: Int, var abc: Option[String],
myList: List[Int], val date : Date = LocalDateTime.now()) {
def getLDT: LocalDateTime = date
}
object DateTimeSerialization extends App {
implicit val formats = Serialization.formats(NoTypeHints)
val ser = write(Child("Mary", 5, None, List(1, 2)))
// Child class converted to string {"str":"Mary","Num":5,"myList":[1,2],"date":"2015-07-21T03:07:05.699Z"}
println(" Child class converted to string " + ser)
var obj=read[Child](ser)
// Object of Child is Child(Mary,5,None,List(1, 2),Tue Jul 21 11:48:22 CST 2015)
println(" Object of Child is "+ obj)
// LocalDateTime of Child is 2015-07-21T11:48:22.075
println(" LocalDateTime of Child is "+ obj.getLDT)
}
Anyway, hope it helps.