Convert DDL String to Spark structType? - java

I have hive/redshift tables and I want to create a spark data frame with precisely the DDL of the original tables, written in JAVA. Is there an option to achieve that?
I think maybe is better to convert the DDL string to Spark schema json, and from that create a df struct type. I started to investiage the spark parser api
String ddlString = "CREATE TABLE data.baab (" +
"id STRING, " +
"test STRING, " +
"test2 STRING, " +
"audit STRUCT<createdDate: TIMESTAMP, createdBy: STRING, lastModifiedDate: TIMESTAMP, lastModifiedBy: STRING>) " +
"USING parquet " +
"LOCATION 's3://test.com' " +
"TBLPROPERTIES ('transient_lastDdlTime' = '1676593278')";
SparkSqlParser parser = new SparkSqlParser();
and I cant see anything that related to ddl parser:
override def parseDataType(sqlText : _root_.scala.Predef.String) : org.apache.spark.sql.types.DataType = { /* compiled code */ }
override def parseExpression(sqlText : _root_.scala.Predef.String) : org.apache.spark.sql.catalyst.expressions.Expression = { /* compiled code */ }
override def parseTableIdentifier(sqlText : _root_.scala.Predef.String) : org.apache.spark.sql.catalyst.TableIdentifier = { /* compiled code */ }
override def parseFunctionIdentifier(sqlText : _root_.scala.Predef.String) : org.apache.spark.sql.catalyst.FunctionIdentifier = { /* compiled code */ }
override def parseMultipartIdentifier(sqlText : _root_.scala.Predef.String) : scala.Seq[_root_.scala.Predef.String] = { /* compiled code */ }
override def parseTableSchema(sqlText : _root_.scala.Predef.String) : org.apache.spark.sql.types.StructType = { /* compiled code */ }
override def parsePlan(sqlText : _root_.scala.Predef.String) : org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = { /* compiled code */ }
protected def astBuilder : org.apache.spark.sql.catalyst.parser.AstBuilder
protected def parse[T](command : _root_.scala.Predef.String)(toResult : scala.Function1[org.apache.spark.sql.catalyst.parser.SqlBaseParser, T]) : T = { /* compiled code */ }
this is what I tried:
StructType struct = null;
Pattern pattern = Pattern.compile("\\(([^()]*)\\)");
Matcher matcher = pattern.matcher(ddlString);
if (matcher.find()) {
String result = matcher.group(1);
struct = StructType.fromDDL(result);
}
return struct;
this is work, but I afraid that this solution will not covert all the cases.
Any suggestions?

Related

functionally combine list of same object

I have a set of alerts that I need to combine and output. I'm struggling to see how I can do this functionally. I have everything I need I just want to combine format a little and output.
orderedStatuses contains a set of alerts
data class Alert(
val status: String,
val recordId: String
)
This is what I'm currently returning
Alerts:
Status1 :
000000000000
Status1 :
111111111111
Status2 :
222222222222
Status2 :
333333333333
Status3 :
444444444444
Status3 :
555555555555
this is what I want:
Alerts:
status1 :
('00000', '111111')
status2 :
('222222', '333333')
status3 :
('444444', '55555')
current code:
val alert = if (orderedStatuses.isEmpty()) {
"No alerts found for status"
} else {
"Records:\n" + orderedStatuses.joinToString("\n") { it ->
"\t${it.status} : \n" + it.recordId
}
}
data class Alert(
val status: String,
val recordId: String
)
val alerts = listOf(
Alert("Status1", "00000"),
Alert("Status1", "111111"),
Alert("Status2", "222222"),
Alert("Status2", "333333"),
Alert("Status3", "444444"),
Alert("Status3", "55555")
)
alerts
.groupBy { it.status }
.map { map -> map.key + " : \n('" + map.value.joinToString("', '") { it.recordId } + "')\n" }
.forEach { print(it) }
This will print:
Status1 :
('00000', '111111')
Status2 :
('222222', '333333')
Status3 :
('444444', '55555')
This might be more readable:
alerts
.groupBy(Alert::status)
.map { (key, value) ->
key + " : \n('" + value.joinToString("', '", transform = Alert::recordId) + "')\n"
}
.forEach(::print)
Detailed example on Kotlin Playground

Scala udf UnsupportedOperationException

I have a dataframe a2 written in scala :
val a3 = a2.select(printme.apply(col(“PlayerReference”)))
the column PlayerReference contains a string.
that calls an udf function :
val printme = udf({
st: String =>
val x = new JustPrint(st)
x.printMe();
})
this udf function calls a java class :
public class JustPrint {
private String ss = null;
public JustPrint(String ss) {
this.ss = ss;
}
public void printMe() {
System.out.println("Value : " + this.ss);
}
}
but i have this error for the udf :
java.lang.UnsupportedOperationException: Schema for type Unit is not supported
The goal of this exercise is to validate the chain of calls.
What should I do to solve this problem ?
The reason you're getting this error is that your UDF doesn't return anything, which, in terms of spark is called Unit.
What you should do depends on what you actually want, but, assuming you just want to track values coming through your UDF you should either change printMe so it returns String, or the UDF.
Like this:
public String printMe() {
System.out.println("Value : " + this.ss);
return this.ss;
}
or like this:
val printme = udf({
st: String =>
val x = new JustPrint(st)
x.printMe();
x
})

Extract specific token out of ANTLR Parse Tree

i'm trying to extract data from the ANTLR parse tree, but not fully grasping how this should be done correctly
Let's say i have the following two SQL queries:
// language=SQL
val sql3 = """
CREATE TABLE session(
id uuid not null
constraint account_pk
primary key,
created timestamp default now() not null
)
""".trimIndent()
// language=SQL
val sql4 = """
CREATE TABLE IF NOT EXISTS blah(
id uuid not null
constraint account_pk
primary key,
created timestamp default now() not null
)
""".trimIndent()
Now i parse both of them:
val visitor = Visitor()
listOf(sql3, sql4).forEach { sql ->
val lexer = SQLLexer(CharStreams.fromString(sql))
val parser = SQLParser(CommonTokenStream(lexer))
visitor.visit(parser.sql())
println(visitor.tableName)
}
In my visitor if i visit the tableCreateStatement, i get the parse tree, but obviously just grabbing child1 will work for sql3, but not for sql4 since child1 in sql4 is IF NOT EXISTS
class Visitor : SQLParserBaseVisitor<Unit>() {
var tableName = ""
override fun visitCreate_table_statement(ctx: SQLParser.Create_table_statementContext?) {
tableName = ctx?.getChild(1)?.text ?: ""
super.visitCreate_table_statement(ctx)
}
}
Is there a way to find a specific token in the parse tree?
I'm assuming the payload has something to do with it, but since it's of type Any, i'm not sure what to check it against
override fun visitCreate_table_statement(ctx: SQLParser.Create_table_statementContext?) {
ctx?.children?.forEach {
if (it.payload.javaClass == SQLParser::Schema_qualified_nameContext) {
tableName = it.text
}
}
super.visitCreate_table_statement(ctx)
}
EDIT: the .g4 files are from
https://github.com/pgcodekeeper/pgcodekeeper/tree/master/apgdiff/antlr-src
this seems to work
override fun visitCreate_table_statement(ctx: SQLParser.Create_table_statementContext?) {
ctx?.children?.forEach {
if (it.payload.javaClass == Schema_qualified_nameContext::class.java) {
tableName = it.text
}
}
super.visitCreate_table_statement(ctx)
}
For branching trees
fun walkLeaves(
childTree: ParseTree = internalTree,
leave: (childTree: ParseTree) -> Unit) {
if (childTree.childCount == 0) {
if (!childTree.text?.trim().isNullOrBlank()) {
leave(childTree)
}
} else {
for (i in 0 until childTree.childCount) {
walkLeaves(childTree = childTree.getChild(i), leave = leave)
}
}
}
fun extractSQL(
childTree: ParseTree,
tokens: MutableList<String> = mutableListOf()
): String {
walkLeaves(childTree = childTree) { leave ->
tokens.add(leave.text)
}
...
}

Junit5 TestReporter

I was trying understand TestReporter in Junit5
#BeforeEach
void beforeEach(TestInfo testInfo) {
}
#ParameterizedTest
#ValueSource(strings = "foo")
void testWithRegularParameterResolver(String argument, TestReporter testReporter) {
testReporter.publishEntry("argument", argument);
}
#AfterEach
void afterEach(TestInfo testInfo) {
// ...
}
what is the use of publishEntry in TestReporter,
Can someone explain me.. Thanks in Advance..
"TestReporter" in conjunction with "TestInfo" gives an instance of the current test, this way you can get info about your actual test. and then publish it, in this example used as kind of logger.
StringBuffer is used for his mutable, fast, and synchonized characteristics, required for a test.
public class TestReporterTest {
StringBuffer sbtags = new StringBuffer();
StringBuffer displayName = new StringBuffer();
StringBuffer className = new StringBuffer();
StringBuffer methodName = new StringBuffer();
#BeforeEach
void init(TestInfo testInfo) {
className.delete( 0, className.length());
className.append( testInfo.getTestClass().get().getName());
displayName.delete( 0, displayName.length());
displayName.append( testInfo.getDisplayName());
methodName.delete( 0, methodName.length());
methodName.append( testInfo.getTestMethod().get().getName());
}
#Test
#DisplayName("testing on reportSingleValue")
void reportSingleValue(TestReporter testReporter) {
testReporter.publishEntry( "className : " + className);
testReporter.publishEntry( "displayName: " + displayName);
testReporter.publishEntry("methodName : " + methodName);
testReporter.publishEntry("algun mensaje de estatus");
}
#Test
void reportKeyValuePair(TestReporter testReporter) {
testReporter.publishEntry( "className : " + className);
testReporter.publishEntry( "displayName: " + displayName);
testReporter.publishEntry("methodName : " + methodName);
testReporter.publishEntry("una Key", "un Value");
}
#Test
void reportMultiKeyValuePairs(TestReporter testReporter) {
Map<String, String> map = new HashMap<>();
map.put("Fast and Furious 8","2018");
map.put("Matrix","1999");
testReporter.publishEntry( "className : " + className);
testReporter.publishEntry( "displayName: " + displayName);
testReporter.publishEntry("methodName : " + methodName);
testReporter.publishEntry(map);
}
}
Running the Test
timestamp = 2019-11-22T12:02:45.898, value = className : TestReporterTest
timestamp = 2019-11-22T12:02:45.904, value = displayName: testing on reportSingleValue
timestamp = 2019-11-22T12:02:45.904, value = methodName : reportSingleValue
timestamp = 2019-11-22T12:02:45.904, value = algun mensaje de estatus
timestamp = 2019-11-22T12:02:45.919, value = className : TestReporterTest
timestamp = 2019-11-22T12:02:45.920, value = displayName: reportMultiKeyValuePairs(TestReporter)
timestamp = 2019-11-22T12:02:45.920, value = methodName : reportMultiKeyValuePairs
timestamp = 2019-11-22T12:02:45.921, Fast and Furious 8 = 2018, Matrix = 1999
timestamp = 2019-11-22T12:02:45.924, value = className : TestReporterTest
timestamp = 2019-11-22T12:02:45.925, value = displayName: reportKeyValuePair(TestReporter)
timestamp = 2019-11-22T12:02:45.925, value = methodName : reportKeyValuePair
timestamp = 2019-11-22T12:02:45.925, una Key = un Value
Apart from the previous answers, When we are writing junit test scripts if we want to get some information out of the process we normally do System.out.println which is not recommended in corporate/enterprise world. Specially in code reviews, peer reviews we are advised to remove all the System.out.println from the code base. So in the junit world if we want to push or publish out of the scripts we are advised to use TestReporter publishEntry() method. With the combination of TestInfo we could read several information out of the original junit scripts.
Hope this facts also support your question.
The method name suggests you are publishing a new entry to the report, which is supported by the Java Doc for 5.3.0
https://junit.org/junit5/docs/current/api/org/junit/jupiter/api/TestReporter.html
This would allow you to add additional, useful information to the test report; perhaps you would like to add what the tests initial conditions are to the report or some environmental information.

How to write a custom serializer for Java 8 LocalDateTime

I have a class named Child1 which I want to convert into JSON using Lift Json. Everything is working fine i was using joda date time but now i want to use Java 8 LocalDateTime but i am unable to write custom serializer for this here is my code
import org.joda.time.DateTime
import net.liftweb.json.Serialization.{ read, write }
import net.liftweb.json.DefaultFormats
import net.liftweb.json.Serializer
import net.liftweb.json.JsonAST._
import net.liftweb.json.Formats
import net.liftweb.json.TypeInfo
import net.liftweb.json.MappingException
class Child1Serializer extends Serializer[Child1] {
private val IntervalClass = classOf[Child1]
def deserialize(implicit format: Formats): PartialFunction[(TypeInfo, JValue), Child1] = {
case (TypeInfo(IntervalClass, _), json) => json match {
case JObject(
JField("str", JString(str)) :: JField("Num", JInt(num)) ::
JField("MyList", JArray(mylist)) :: (JField("myDate", JInt(mydate)) ::
JField("number", JInt(number)) ::Nil)
) => {
val c = Child1(
str, num.intValue(), mylist.map(_.values.toString.toInt), new DateTime(mydate.longValue)
)
c.number = number.intValue()
c
}
case x => throw new MappingException("Can't convert " + x + " to Interval")
}
}
def serialize(implicit format: Formats): PartialFunction[Any, JValue] = {
case x: Child1 =>
JObject(
JField("str", JString(x.str)) :: JField("Num", JInt(x.Num)) ::
JField("MyList", JArray(x.MyList.map(JInt(_)))) ::
JField("myDate", JInt(BigInt(x.myDate.getMillis))) ::
JField("number", JInt(x.number)) :: Nil
)
}
}
Object Test extends App {
case class Child1(var str:String, var Num:Int, MyList:List[Int], myDate:DateTime) {
var number: Int=555
}
val c = Child1("Mary", 5, List(1, 2), DateTime.now())
c.number = 1
println("number" + c.number)
implicit val formats = DefaultFormats + new Child1Serializer
val ser = write(c)
println("Child class converted to string" + ser)
var obj = read[Child1](ser)
println("object of Child is "+ obj)
println("str" + obj.str)
println("Num" + obj.Num)
println("MyList" + obj.MyList)
println("myDate" + obj.myDate)
println("number" + obj.number)
}
now i want to use Java 8 LocalDateTime like this
case class Child1(var str: String, var Num: Int, MyList: List[Int], val myDate: LocalDateTime = LocalDateTime.now()) {
var number: Int=555
}
what modification do i need to make in my custom serializer class Child1Serializer i tried to do it but i was unable to do it please help me
In the serializer, serialize the date like this:
def serialize(implicit format: Formats): PartialFunction[Any, JValue] = {
case x: Child1 =>
JObject(
JField("str", JString(x.str)) :: JField("Num", JInt(x.Num)) ::
JField("MyList", JArray(x.MyList.map(JInt(_)))) ::
JField("myDate", JString(x.myDate.toString)) ::
JField("number", JInt(x.number)) :: Nil
)
}
In the deserializer,
def deserialize(implicit format: Formats): PartialFunction[(TypeInfo, JValue), Child1] = {
case (TypeInfo(IntervalClass, _), json) => json match {
case JObject(
JField("str", JString(str)) :: JField("Num", JInt(num)) ::
JField("MyList", JArray(mylist)) :: (JField("myDate", JString(mydate)) ::
JField("number", JInt(number)) ::Nil)
) => {
val c = Child1(
str, num.intValue(), mylist.map(_.values.toString.toInt), LocalDateTime.parse(myDate)
)
c.number = number.intValue()
c
}
case x => throw new MappingException("Can't convert " + x + " to Interval")
}
}
The LocalDateTime object writes to an ISO format using toString and the parse factory method should be able to reconstruct the object from such a string.
You can define the LocalDateTime serializer like this.
class LocalDateTimeSerializer extends Serializer[LocalDateTime] {
private val LocalDateTimeClass = classOf[LocalDateTime]
def deserialize(implicit format: Formats): PartialFunction[(TypeInfo, JValue), LocalDateTime] = {
case (TypeInfo(LocalDateTimeClass, _), json) => json match {
case JString(dt) => LocalDateTime.parse(dt)
case x => throw new MappingException("Can't convert " + x + " to LocalDateTime")
}
}
def serialize(implicit format: Formats): PartialFunction[Any, JValue] = {
case x: LocalDateTime => JString(x.toString)
}
}
Also define your formats like this
implicit val formats = DefaultFormats + new LocalDateTimeSerializer + new FieldSerializer[Child1]
Please note the usage of FieldSerializer to serialize the non-constructor filed, number

Categories

Resources