I want to parametrice when i have a header and then separator when I read a csv from Spark. I've written this:
DataFrameReader dataFrameReader = spark.read();
dataFrameReader = "csv".equalsIgnoreCase(params.getReadFileType()) ?
dataFrameReader
.option("sep",params.getDelimiter())
.option("header",params.isHeader())
:dataFrameReader;
I'm new to Groovy and I don't get dataFrameReader.option corrected mocked.
DataFrameReader dfReaderLoader = Mock(DataFrameReader)
DataFrameReader dfReaderOptionString = Mock(DataFrameReader)
DataFrameReader dfReaderOptionBoolean = Mock(DataFrameReader)
SparkSession sparkSession = Mock(SparkSession)
sparkSession.read() >> dfReaderLoader
dfReaderLoader.option(_ as String, _ as String) >> dfReaderOptionString
dfReaderOptionString.option(_ as String, _ as Boolean) >> dfReaderOptionBoolean
And it gives me a null pointer exception.
java.lang.NullPointerException: Cannot invoke
"org.apache.spark.sql.DataFrameReader.option(String, boolean)" because
the return value of
"org.apache.spark.sql.DataFrameReader.option(String, String)" is null
I do not know what your problem is, but my guess is that you create the mocks, but then do not inject them into your class under test. If you do that, both your own version as well as Leonard's suggested improved version with a default response work:
Class under test + helper class:
class UnderTest {
SparkSession spark
Parameters params
DataFrameReader produce() {
DataFrameReader dataFrameReader = spark.read()
dataFrameReader = "csv".equalsIgnoreCase(params.getReadFileType()) ?
dataFrameReader
.option("sep", params.getDelimiter())
.option("header", params.isHeader())
: dataFrameReader
}
}
class Parameters {
String readFileType
String delimiter
boolean header
}
Spock specification:
package de.scrum_master.stackoverflow.q74923254
import org.apache.spark.sql.DataFrameReader
import org.apache.spark.sql.SparkSession
import org.spockframework.mock.MockUtil
import spock.lang.Specification
class DataFrameReaderTest extends Specification {
def 'read #readFileType data'() {
given:
DataFrameReader dfReaderLoader = Mock(DataFrameReader)
DataFrameReader dfReaderOptionString = Mock(DataFrameReader)
DataFrameReader dfReaderOptionBoolean = Mock(DataFrameReader)
SparkSession sparkSession = Mock(SparkSession)
sparkSession.read() >> dfReaderLoader
dfReaderLoader.option(_ as String, _ as String) >> dfReaderOptionString
dfReaderOptionString.option(_ as String, _ as Boolean) >> dfReaderOptionBoolean
def underTest = new UnderTest(spark: sparkSession, params: parameters)
expect:
underTest.produce().toString().contains(returnedMockName)
where:
readFileType | parameters | returnedMockName
'CSV' | new Parameters(readFileType: readFileType, delimiter: ';', header: true) | 'dfReaderOptionBoolean'
'XLS' | new Parameters(readFileType: readFileType) | 'dfReaderLoader'
}
def 'read #readFileType data (improved)'() {
given:
SparkSession sparkSession = Mock() {
read() >> Mock(DataFrameReader) {
_ >> _
}
}
def parameters = new Parameters(readFileType: readFileType, delimiter: ';', header: true)
def underTest = new UnderTest(spark: sparkSession, params: parameters)
expect:
new MockUtil().isMock(underTest.produce())
where:
readFileType << ['CSV', 'XLS']
}
}
Try it in the Groovy Web Console.
The result should look similar to this in your IDE:
DataFrameReaderTest ✔
├─ read #readFileType data ✔
│ ├─ read CSV data ✔
│ └─ read XLS data ✔
└─ read #readFileType data (improved) ✔
├─ read CSV data (improved) ✔
└─ read XLS data (improved) ✔
If you don't really care about the intermediate invocations of a builder pattern, i.e. an object that returns itself. I'd suggest to use a Stub, which will return itself, if the method return type matches it's type, or you can use this declaration _ >> _ to achieve the same for Mocks.
given:
ThingBuilder builder = Mock() {
_ >> _
}
when:
Thing thing = builder
.id("id-42")
.name("spock")
.weight(100)
.build()
then:
1 * builder.build() >> new Thing(id: 'id-1337') // <-- only assert the last call you actually care about
thing.id == 'id-1337'
Try it in the Groovy Web Console.
That being said, the error would probably go away if you just remove the as String cast of the second argument of option, or fix it to be as Boolean as the error suggests.
The error was in the params.I was not sending Delimiter or Header so it gave error.
Related
I am trying to compile an Actor with DynamicMessage with Scala Reflection Toolbox
The actor code looks like
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val toolbox = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val actorCode =
"""
|import akka.actor._
|import com.google.protobuf._
|class SimpleActor extends Actor {
|override def receive: Receive = {
| case dynamicMessage: DynamicMessage => println("Dynamic message received!")
| case _ => println("Whatever!") // the default, catch-all
| }
|}
|object SimpleActor {
|def props() : Props = Props(new SimpleActor())
|}
|
|
|return SimpleActor.props()
|""".stripMargin
val tree = toolbox.parse(actorCode)
toolbox.compile(tree)().asInstanceOf[Props]
I get the error
reflective compilation has failed:
illegal cyclic reference involving type T
scala.tools.reflect.ToolBoxError: reflective compilation has failed:
illegal cyclic reference involving type T
If I run the code outside of the Toolbox it compiles and works fine.
The error is given from the line
case dynamicMessage: DynamicMessage => println("Dynamic message received!")
Anyone knows the nature of this error and how to fix it?
In Scala even without reflective compilation there are bugs in combination of scala-java interop and F-bounded polymorphism:
scalac reports error on valid Java class: illegal cyclic reference involving type T
and others.
And parents of com.google.protobuf.DynamicMessage explore F-bounds:
DynamicMessage
<: AbstractMessage
<: AbstractMessageLite[_,_] (such inheritance is allowed in Java but not in Scala)
[M <: AbstractMessageLite[M, B],
B <: AbstractMessageLite.Builder[M, B]]
[M <: AbstractMessageLite[M, B],
B <: AbstractMessageLite.Builder[M, B]]
<: MessageLite.Builder
<: MessageLiteOrBuilder
<: Cloneable
<: MessageLite
<: MessageLiteOrBuilder
<: Message
<: MessageLite...
<: MessageOrBuilder
<: MessageLiteOrBuilder
But without reflective compilation your code compiles. So this is a bug in combination of reflective compilation, scala-java interop and F-bounded polymorphism.
A workaround is to use real compiler instead of toolbox:
import akka.actor.{ActorSystem, Props}
// libraryDependencies += "com.github.os72" % "protobuf-dynamic" % "1.0.1"
import com.github.os72.protobuf.dynamic.{DynamicSchema, MessageDefinition}
import com.google.protobuf.DynamicMessage
import scala.reflect.internal.util.{AbstractFileClassLoader, BatchSourceFile}
import scala.reflect.io.{AbstractFile, VirtualDirectory}
import scala.reflect.runtime
import scala.reflect.runtime.universe
import scala.reflect.runtime.universe._
import scala.tools.nsc.{Global, Settings}
val actorCode = """
|import akka.actor._
|import com.google.protobuf._
|
|class SimpleActor extends Actor {
| override def receive: Receive = {
| case dynamicMessage: DynamicMessage => println("Dynamic message received!")
| case _ => println("Whatever!") // the default, catch-all
| }
|}
|
|object SimpleActor {
| def props() : Props = Props(new SimpleActor())
|}
|""".stripMargin
val directory = new VirtualDirectory("(memory)", None)
val runtimeMirror = createRuntimeMirror(directory, runtime.currentMirror)
compileCode(actorCode, List(), directory)
val props = runObjectMethod("SimpleActor", runtimeMirror, "props")
.asInstanceOf[Props]
val actorSystem = ActorSystem("actorSystem")
val actor = actorSystem.actorOf(props, "helloActor")
val msg = makeDynamicMessage()
actor ! "hello" // Whatever!
actor ! msg // Dynamic message received!
actorSystem.terminate()
//see (*)
def makeDynamicMessage(): DynamicMessage = {
val schemaBuilder = DynamicSchema.newBuilder
schemaBuilder.setName("PersonSchemaDynamic.proto")
val msgDef = MessageDefinition.newBuilder("Person")
.addField("required", "int32", "id", 1)
.build
schemaBuilder.addMessageDefinition(msgDef)
val schema = schemaBuilder.build
val msgBuilder = schema.newMessageBuilder("Person")
val msgDesc = msgBuilder.getDescriptorForType
msgBuilder
.setField(msgDesc.findFieldByName("id"), 1)
.build
}
def compileCode(
code: String,
classpathDirectories: List[AbstractFile],
outputDirectory: AbstractFile
): Unit = {
val settings = new Settings
classpathDirectories.foreach(dir => settings.classpath.prepend(dir.toString))
settings.outputDirs.setSingleOutput(outputDirectory)
settings.usejavacp.value = true
val global = new Global(settings)
(new global.Run).compileSources(List(new BatchSourceFile("(inline)", code)))
}
def runObjectMethod(
objectName: String,
runtimeMirror: Mirror,
methodName: String,
arguments: Any*
): Any = {
val objectSymbol = runtimeMirror.staticModule(objectName)
val objectModuleMirror = runtimeMirror.reflectModule(objectSymbol)
val objectInstance = objectModuleMirror.instance
val objectType = objectSymbol.typeSignature
val methodSymbol = objectType.decl(TermName(methodName)).asMethod
val objectInstanceMirror = runtimeMirror.reflect(objectInstance)
val methodMirror = objectInstanceMirror.reflectMethod(methodSymbol)
methodMirror(arguments: _*)
}
def createRuntimeMirror(directory: AbstractFile, parentMirror: Mirror): Mirror = {
val classLoader = new AbstractFileClassLoader(directory, parentMirror.classLoader)
universe.runtimeMirror(classLoader)
}
Tensorflow in Scala reflection (here was a similar situation with a bug in combination of reflective compilation, Scala-Java interop, and path-dependent types)
Dynamic compilation of multiple Scala classes at runtime
How to eval code that uses InterfaceStability annotation (that fails with "illegal cyclic reference involving class InterfaceStability")? (also "illegal cyclic reference" during reflective compilation)
Scala Presentation Compiler - Minimal Example
(*) Protocol buffer objects generated at runtime
I have dataframe , wanted to convert into JSON ARRAY Please find the example below
Dataframe
+------------+--------------------+----------+----------------+------------------+--------------
| Name| id|request_id|create_timestamp|deadline_timestamp|
+------------+--------------------+----------+----------------+------------------+--------------
| Freeform|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
| D23|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
| Stores|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
|VacationClub|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
Wanted in Json Like below:
[
{
"testname":"xyz",
"systemResponse":[
{
"name":"FGH",
"id":"59bbe3ad-f487-44",
"request_id":1590791280,
"create_timestamp":1590799280
},
{
"name":"FGH",
"id":"59bbe3ad-f487-44",
"request_id":1590791280,
"create_timestamp":1590799280,
}
]
}
]
You can define 2 beans
Create Array from the 1st DF as Array of inner Beans
Define a parent bean with testname and requestDetailArray as Array
Please also find code inline comments
object DataToJsonArray {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load you dataframe
val requestDetailArray = List(
("Freeform", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("D23", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("Stores", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("VacationClub", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556")
).toDF
//Map your Dataframe to RequestDetails bean
.map(row => RequestDetails(row.getString(0), row.getString(1), row.getString(2), row.getString(3), row.getString(4)))
//Collect it as Array
.collect()
//Create another data frme with List[BaseClass] and set the (testname,Array[RequestDetails])
List(BaseClass("xyz", requestDetailArray)).toDF()
.write
//Output your Dataframe as JSON
.json("/json/output/path")
}
}
case class RequestDetails(Name: String, id: String, request_id: String, create_timestamp: String, deadline_timestamp: String)
case class BaseClass(testname: String = "xyz", systemResponse: Array[RequestDetails])
Check below code.
import org.apache.spark.sql.functions._
df.withColumn("systemResponse",
array(
struct("id","request_id","create_timestamp","deadline_timestamp").as("data")
)
)
.select("systemResponse")
.toJSON
.select(col("value").as("json_data"))
.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
+-----------------------------------------------------------------------------------------------------------------------------------------------+
Updated
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn("systemResponse",
array(
struct("id","request_id","create_timestamp","deadline_timestamp").as("data")
)
)
.withColumn("testname",lit("xyz"))
.select("testname","systemResponse")
.toJSON
.select(col("value").as("json_data"))
.show(false)
// Exiting paste mode, now interpreting.
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have a small dataset that has the population data by country on HDFS. I have written the code to parse it and load it into Dataset<Row>
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
SparkContext context = new SparkContext(conf);
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load(args[1]);
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
The console shows the data correctly -
+-----------------------+-------------------+-------------+---------------+----------+
|countriesAndTerritories| location| continent|population_year|population|
+-----------------------+-------------------+-------------+---------------+----------+
| Afghanistan| Afghanistan| Asia| 2020| 38928341|
| Albania| Albania| Europe| 2020| 2877800|
| Algeria| Algeria| Africa| 2020| 43851043|
| Andorra| Andorra| Europe| 2020| 77265|
However, I want to get the population of United States into an int variable.
The query to choose the population is
Dataset<String>xdc = df.select(col("population"))
.where(col("location").equalTo("United States")).limit(1)
But how do I get the contents of it into int variable?
You can try that:
int v = Integer.parseInt(
df.select(col("population"))
.where(col("location").equalTo("United States"))
.limit(1)
.first()
.get(0)
.toString()
);
I have a code which looks like below
object ErrorTest {
case class APIResults(status:String, col_1:Long, col_2:Double, ...)
def funcA(rows:ArrayBuffer[Row])(implicit defaultFormats:DefaultFormats):ArrayBuffer[APIResults] = {
//call some API ang get results and return APIResults
...
}
// MARK: load properties
val props = loadProperties()
private def loadProperties(): Properties = {
val configFile = new File("config.properties")
val reader = new FileReader(configFile)
val props = new Properties()
props.load(reader)
props
}
def main(args: Array[String]): Unit = {
val prop_a = props.getProperty("prop_a")
val session = Context.initialSparkSession();
import session.implicits._
val initialSet = ArrayBuffer.empty[Row]
val addToSet = (s: ArrayBuffer[Row], v: Row) => (s += v)
val mergePartitionSets = (p1: ArrayBuffer[Row], p2: ArrayBuffer[Row]) => (p1 ++= p2)
val sql1 =
s"""
select * from tbl_a where ...
"""
session.sql(sql1)
.rdd.map{row => {implicit val formats = DefaultFormats; (row.getLong(6), row)}}
.aggregateByKey(initialSet)(addToSet,mergePartitionSets)
.repartition(40)
.map{case (rowNumber,rows) => {implicit val formats = DefaultFormats; funcA(rows)}}
.flatMap(x => x)
.toDF()
.write.mode(SaveMode.Overwrite).saveAsTable("tbl_b")
}
}
when I run it via spark-submit, it throws error Caused by: java.lang.NoClassDefFoundError: Could not initialize class staging_jobs.ErrorTest$. But if I move val props = loadProperties() into the first line of main method, then there's no error anymore. Could anyone give me a explanation on this phenomenon? Thanks a lot!
Caused by: java.lang.NoClassDefFoundError: Could not initialize class staging_jobs.ErrorTest$
at staging_jobs.ErrorTest$$anonfun$main$1.apply(ErrorTest.scala:208)
at staging_jobs.ErrorTest$$anonfun$main$1.apply(ErrorTest.scala:208)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
... 8 more
I've met the same question as you. I defined a method convert outside main method. When I use it with dataframe.rdd.map{x => convert(x)} in main , NoClassDefFoundError:Could not initialize class Test$ happened.
But when I use a function object convertor, which is the same code with convert method, in main method, no error happened.
I used spark 2.1.0, scala 2.11, it seems like a bug in spark?
I guess the problem is that val props = loadProperties() defines a member for the outer class (of main). Then this member will be serialized (or run) on the executors, which do not have the save environment with the driver.
I am new to Apache Spark, I created several RDD's and DataFrames, cached them, now I want to unpersist some of them by using the command below
rddName.unpersist()
but I can't remember their names. I used sc.getPersistentRDDs but the output does not include the names. I also used the browser to view the cached rdds but again no name information. Am I missing something?
PySparkers: getPersistentRDDs isn't yet implemented in Python, so unpersist your RDDs by dipping into Java:
for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():
rdd.unpersist()
#Dikei's answer is actually correct but I believe what you are looking for is sc.getPersistentRDDs :
scala> val rdd1 = sc.makeRDD(1 to 100)
# rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at <console>:27
scala> val rdd2 = sc.makeRDD(10 to 1000)
# rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at <console>:27
scala> rdd2.cache.setName("rdd_2")
# res0: rdd2.type = rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res1: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27)
scala> rdd1.cache.setName("foo")
# res2: rdd1.type = foo ParallelCollectionRDD[0] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res3: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)
Now let's add another RDD and name it as well :
scala> rdd3.setName("bar")
# res4: rdd3.type = bar ParallelCollectionRDD[2] at makeRDD at <console>:27
scala> sc.getPersistentRDDs
# res5: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(1 -> rdd_2 ParallelCollectionRDD[1] at makeRDD at <console>:27, 0 -> foo ParallelCollectionRDD[0] at makeRDD at <console>:27)
We noticed that actually it isn't persisted.
Scala generic way of doing this ... loop through spark context get all persistent RDDs and unpersist.
I will use this at the end of a driver.
for ( (id,rdd) <- sparkSession.sparkContext.getPersistentRDDs ) {
log.info("Unexpected cached RDD " + id)
rdd.unpersist()
}
Java Generic way of doing this ... where jsc is JavaSparkContext
if (jsc != null) {
Map<Integer, JavaRDD<?>> persistentRDDS = jsc.getPersistentRDDs();
// using for-each loop for iteration over Map.entrySet()
for (Map.Entry<Integer, JavaRDD<?>> entry : persistentRDDS.entrySet()) {
LOG.info("Key = " + entry.getKey() +
", un persisting cached RDD = " + entry.getValue().unpersist());
}
}
Another short form of unpersist in java with out knowing rdd names are :
Map<Integer, JavaRDD<?>> persistentRDDS = jsc.getPersistentRDDs();
persistentRDDS.values().forEach(JavaRDD::unpersist);
There's no special meaning to the rrdName variable. It is just a reference to an RDD. For example, in the following code
val rrdName: RDD[Something]
val name2 = rrdName
name2 and rrdName are two references that point to the same RDD. Calling name2.unpersist is the same as calling rrdName.unpersist.
If you want to unpersist an RDD, you have to manually keep a reference to it.