I am attempting to read avro data from Kafka using Spark Streaming but I receive the following error message:
Streaming Query Exception caught!: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: [id = 8b54c92d-6bbc-4dbc-84d0-55b762c21ba2, runId = 4bc92b3c-343e-4886-b0bc-0777b89f9ec8]
Current Committed Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":17}}}
Current Available Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":20}}}
Current State: ACTIVE
Thread State: RUNNABLE
Any idea on what the issue might be and how to resolve it? Code is the following (inspired from xebia-france spark-structured-streaming-blog). Actually, I think it ran earlier already but now there is a problem.
import com.databricks.spark.avro.SchemaConverters
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.StreamingQueryException
object AvroConsumer {
private val topic = "customer-avro4"
private val kafkaUrl = "http://localhost:9092"
private val schemaRegistryUrl = "http://localhost:8081"
private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema
private val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("ConfluentConsumer")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes)
)
val kafkaDataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("subscribe", topic)
.load()
val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
import org.apache.spark.sql.functions._
val formattedDataFrame = valueDataFrame.select(
from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
val writer = formattedDataFrame
.writeStream
.format("parquet")
.option("checkpointLocation", "hdfs://localhost:9000/data/spark/parquet/checkpoint")
while (true) {
val query = writer.start("hdfs://localhost:9000/data/spark/parquet/total")
try {
query.awaitTermination()
}
catch {
case e: StreamingQueryException => println("Streaming Query Exception caught!: " + e);
}
}
}
object DeserializerWrapper {
val deserializer: AvroDeserializer = kafkaAvroDeserializer
}
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}
Figured it out - the problem was not as I had thought with the Spark-Kafka integration directly, but with the checkpoint information inside the hdfs filesystem instead. Deleting and recreating the checkpoint folder in hdfs solved it for me.
Related
I know I can compile individual "snippets" in Scala using the Toolbox like this:
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
object Compiler {
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
def main(args: Array[String]): Unit = {
tb.eval(tb.parse("""println("hello!")"""))
}
}
Is there any way I can compile more than just "snippets", i.e., classes that refer to each other? Like this:
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
object Compiler {
private val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val a: String =
"""
|package pkg {
|
|class A {
|def compute(): Int = 42
|}}
""".stripMargin
val b: String =
"""
|import pkg._
|
|class B {
|def fun(): Unit = {
| new A().compute()
|}
|}
""".stripMargin
def main(args: Array[String]): Unit = {
val compiledA = tb.parse(a)
val compiledB = tb.parse(b)
tb.eval(compiledB)
}
}
Obviously, my snippet doesn't work as I have to tell the toolbox how to resolve "A" somehow:
Exception in thread "main" scala.tools.reflect.ToolBoxError: reflective compilation has failed:
not found: type A
Try
import scala.reflect.runtime.universe._
import scala.reflect.runtime.universe
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val a = q"""
class A {
def compute(): Int = 42
}"""
val symbA = tb.define(a)
val b = q"""
class B {
def fun(): Unit = {
new $symbA().compute()
}
}"""
tb.eval(b)
https://github.com/scala/scala/blob/2.13.x/src/compiler/scala/tools/reflect/ToolBox.scala#L131-L138
In cases more complex than those the toolbox can handle, you can always run the compiler manually
import scala.reflect.internal.util.{AbstractFileClassLoader, BatchSourceFile}
import scala.reflect.io.{AbstractFile, VirtualDirectory}
import scala.tools.nsc.{Global, Settings}
import scala.reflect.runtime
import scala.reflect.runtime.universe
import scala.reflect.runtime.universe._
val a: String =
"""
|package pkg {
|
|class A {
| def compute(): Int = 42
|}}
""".stripMargin
val b: String =
"""
|import pkg._
|
|class B {
| def fun(): Unit = {
| println(new A().compute())
| }
|}
""".stripMargin
val directory = new VirtualDirectory("(memory)", None)
compileCode(List(a, b), List(), directory)
val runtimeMirror = createRuntimeMirror(directory, runtime.currentMirror)
val bInstance = instantiateClass("B", runtimeMirror)
runClassMethod("B", runtimeMirror, "fun", bInstance) // 42
def compileCode(sources: List[String], classpathDirectories: List[AbstractFile], outputDirectory: AbstractFile): Unit = {
val settings = new Settings
classpathDirectories.foreach(dir => settings.classpath.prepend(dir.toString))
settings.outputDirs.setSingleOutput(outputDirectory)
settings.usejavacp.value = true
val global = new Global(settings)
val files = sources.zipWithIndex.map { case (code, i) => new BatchSourceFile(s"(inline-$i)", code) }
(new global.Run).compileSources(files)
}
def instantiateClass(className: String, runtimeMirror: Mirror, arguments: Any*): Any = {
val classSymbol = runtimeMirror.staticClass(className)
val classType = classSymbol.typeSignature
val constructorSymbol = classType.decl(termNames.CONSTRUCTOR).asMethod
val classMirror = runtimeMirror.reflectClass(classSymbol)
val constructorMirror = classMirror.reflectConstructor(constructorSymbol)
constructorMirror(arguments: _*)
}
def runClassMethod(className: String, runtimeMirror: Mirror, methodName: String, classInstance: Any, arguments: Any*): Any = {
val classSymbol = runtimeMirror.staticClass(className)
val classType = classSymbol.typeSignature
val methodSymbol = classType.decl(TermName(methodName)).asMethod
val instanceMirror = runtimeMirror.reflect(classInstance)
val methodMirror = instanceMirror.reflectMethod(methodSymbol)
methodMirror(arguments: _*)
}
//def runObjectMethod(objectName: String, runtimeMirror: Mirror, methodName: String, arguments: Any*): Any = {
// val objectSymbol = runtimeMirror.staticModule(objectName)
// val objectModuleMirror = runtimeMirror.reflectModule(objectSymbol)
// val objectInstance = objectModuleMirror.instance
// val objectType = objectSymbol.typeSignature
// val methodSymbol = objectType.decl(TermName(methodName)).asMethod
// val objectInstanceMirror = runtimeMirror.reflect(objectInstance)
// val methodMirror = objectInstanceMirror.reflectMethod(methodSymbol)
// methodMirror(arguments: _*)
//}
def createRuntimeMirror(directory: AbstractFile, parentMirror: Mirror): Mirror = {
val classLoader = new AbstractFileClassLoader(directory, parentMirror.classLoader)
universe.runtimeMirror(classLoader)
}
dynamically parse json in flink map
Tensorflow in Scala reflection
How to eval code that uses InterfaceStability annotation (that fails with "illegal cyclic reference involving class InterfaceStability")?
I am trying to use the org.apache.hadoop.tools.DistCp class to copy some files over into a S3 bucket. However overwrite functionality is not working in spite of explicitly setting the overwrite flag to true
Copying works fine but it does not overwrite if there are existing files. The copy mapper skips those files. I have explicitly set the "overwrite" option to true.
import com.typesafe.scalalogging.LazyLogging
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
import org.apache.hadoop.util.ToolRunner
import scala.collection.JavaConverters._
object distcptest extends App with LazyLogging {
def copytoS3( hdfsSrcFilePathStr: String, s3DestPathStr: String) = {
val hdfsSrcPathList = List(new Path(hdfsSrcFilePathStr))
val s3DestPath = new Path(s3DestPathStr)
val distcpOpt = new DistCpOptions(hdfsSrcPathList.asJava, s3DestPath)
// Overwriting is not working inspite of explicitly setting it to true.
distcpOpt.setOverwrite(true)
val conf: Configuration = new Configuration()
conf.set("fs.s3n.awsSecretAccessKey", "secret key")
conf.set("fs.s3n.awsAccessKeyId", "access key")
conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val distCp: DistCp = new DistCp(conf, distcpOpt)
val filepaths: Array[String] = Array(hdfsSrcFilePathStr, s3DestPathStr)
try {
val distCp_result = ToolRunner.run(distCp, filepaths)
if (distCp_result != 0) {
logger.error(s"DistCP has failed with - error code = $distCp_result")
}
}
catch {
case e: Exception => {
e.printStackTrace()
}
}
}
copytoS3("hdfs://abc/pqr", "s3n://xyz/wst")
}
I think the problem is you called ToolRunner.run(distCp, filepaths).
If you check the source code of DistCp, in run method will overwrite inputOptions, so the DistCpOptions passed to constructor will not work.
#Override
public int run(String[] argv) {
...
try {
inputOptions = (OptionsParser.parse(argv));
...
} catch (Throwable e) {
...
}
...
}
I am trying to use kotlin instead of Java, I cannot find a good way to do with try resource:
Java Code like this:
import org.tensorflow.Graph;
import org.tensorflow.Session;
import org.tensorflow.Tensor;
import org.tensorflow.TensorFlow;
public class HelloTensorFlow {
public static void main(String[] args) throws Exception {
try (Graph g = new Graph()) {
final String value = "Hello from " + TensorFlow.version();
// Construct the computation graph with a single operation, a constant
// named "MyConst" with a value "value".
try (Tensor t = Tensor.create(value.getBytes("UTF-8"))) {
// The Java API doesn't yet include convenience functions for adding operations.
g.opBuilder("Const", "MyConst").setAttr("dtype", t.dataType()).setAttr("value", t).build();
}
// Execute the "MyConst" operation in a Session.
try (Session s = new Session(g);
// Generally, there may be multiple output tensors,
// all of them must be closed to prevent resource leaks.
Tensor output = s.runner().fetch("MyConst").run().get(0)) {
System.out.println(new String(output.bytesValue(), "UTF-8"));
}
}
}
}
I do it in kotlin, I have to do this:
fun main(args: Array<String>) {
val g = Graph();
try {
val value = "Hello from ${TensorFlow.version()}"
val t = Tensor.create(value.toByteArray(Charsets.UTF_8))
try {
g.opBuilder("Const", "MyConst").setAttr("dtype", t.dataType()).setAttr("value", t).build()
} finally {
t.close()
}
var sess = Session(g)
try {
val output = sess.runner().fetch("MyConst").run().get(0)
println(String(output.bytesValue(), Charsets.UTF_8))
} finally {
sess?.close()
}
} finally {
g.close()
}
}
I have try to use use like this:
Graph().use {
it -> ....
}
I got error like this:
Error:(16, 20) Kotlin: Unresolved reference. None of the following candidates is applicable because of receiver type mismatch:
#InlineOnly public inline fun ???.use(block: (???) -> ???): ??? defined in kotlin.io
I just use wrong dependency:
compile "org.jetbrains.kotlin:kotlin-stdlib"
replace it with:
compile "org.jetbrains.kotlin:kotlin-stdlib-jdk8"
I am creating spark jar file with following scala code embedded in it:
import com.typesafe.config.ConfigFactory
object GetRequest {
def main(args: Array[String]): Unit = {
val api_credentials = ConfigFactory.load("application.conf")
val username = api_credentials.getString("pi.api.username")
val password = api_credentials.getString("pi.api.password")
}
While submitting the jar,it is not able to find application.conf file which is inside path C:\Users\abc\Desktop\ApiSparkJob\resource.How to mention the same in spark-submit command in cli?
The resource file bundled inside a jar wouldn't be available for each spark worker therefore you need to pass the file using --files argument
--files application.conf
If your resource manager is YARN, refer to the below code
import org.apache.hadoop.fs.{FileSystem, Path}
import java.io.{BufferedReader, File, InputStreamReader}
import com.typesafe.config.{Config, ConfigFactory}
import org.apache.spark.sql.SparkSession
object GetRequest {
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder.enableHiveSupport().getOrCreate()
val yarnStagingDir: String = System.getenv("SPARK_YARN_STAGING_DIR")
val confFile: Path = new Path(yarnStagingDir.concat("/application.conf")
val fs: FileSystem = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
val br: BufferedReader = new BufferedReader(new InputStreamReader(fileSystem.open(confFile)))
val api_credentials: Config = ConfigFactory.parseReader(br).resolve()
val username: String = api_credentials.getString("pi.api.username")
val password: String = api_credentials.getString("pi.api.password")
br.close()
}
}
// Don't close the filesystem fs.close() as it ends your job since same filesystem is used to access hive warehouse directory.
Edit: Using Kryo 1.04
I'm right now serializing a User class that contains a java.sql.Timestamp field in Scala. For some reason, Kryo can't find a zero-arg constructor and throws an error:
Caused by: com.esotericsoftware.kryo.SerializationException: Class cannot be created (missing no-arg constructor): java.sql.Timestamp
Serialization trace:
created (com.threetierlogic.AccountService.models.User)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:688)
at com.esotericsoftware.kryo.Serializer.newInstance(Serializer.java:75)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:200)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:220)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:200)
at com.esotericsoftware.kryo.Serializer.readObject(Serializer.java:61)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:589)
... 84 more
Caused by: java.lang.InstantiationException: java.sql.Timestamp
at java.lang.Class.newInstance0(Class.java:340)
at java.lang.Class.newInstance(Class.java:308)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:676)
... 90 more
This is part of a converter class to convert domain objects for Riak. Here's my converter class:
/**
* Kryo converter for passing domain objects into Riak
*/
class UserConverter(val bucket: String) extends Converter[User] {
def fromDomain(domainObject: User, vclock: VClock): IRiakObject = {
val key = domainObject.guid
if(key == null) throw new NoKeySpecifedException(domainObject)
val kryo = new Kryo()
kryo.register(classOf[User])
kryo.register(classOf[Timestamp])
val ob = new ObjectBuffer(kryo)
val value = ob.writeObject(domainObject)
RiakObjectBuilder.newBuilder(bucket, key)
.withValue(value)
.withVClock(vclock)
.withContentType(Constants.CTYPE_OCTET_STREAM)
.build()
}
def toDomain(riakObject: IRiakObject): User = {
if(riakObject == null) null
val kryo = new Kryo()
kryo.register(classOf[User])
kryo.register(classOf[Timestamp])
val ob = new ObjectBuffer(kryo)
ob.readObject(riakObject.getValue(), classOf[User])
}
}
Do I need to extend Timestamp and create a zero argument constructor? Or is there a better workaround?
If I need to upgrade to 2.20, what's the replacement for ObjectBuffer without writing to a file?
A quick look at the Kryo home page suggests that in the absence of a zero-arg constructor, you can create what Kryo calls an "Instantion Strategy" to handle that class. Look in the "Object Creation" section.
You can do something like this :
class KryoSO {
import com.esotericsoftware.kryo.KryoSerializable
import de.javakaffee.kryoserializers.KryoReflectionFactorySupport
import com.esotericsoftware.kryo.Kryo
import com.esotericsoftware.kryo.Serializer
import java.io.{ InputStream, OutputStream }
import com.esotericsoftware.kryo.io.{ Output, Input }
import java.sql.Timestamp
object TimestampSerializer extends Serializer[Timestamp] {
override def write(kryo: Kryo, output: Output, t: Timestamp): Unit = {
output.writeLong(t.getTime(), true);
}
override def read(kryo: Kryo, input: Input, t: Class[Timestamp]): Timestamp = {
new Timestamp(input.readLong(true));
}
override def copy(kryo: Kryo, original: Timestamp): Timestamp = {
new Timestamp(original.getTime());
}
}
val kryo: Kryo = new KryoReflectionFactorySupport
kryo.addDefaultSerializer(classOf[Timestamp], TimestampSerializer)
def serialize(o: Any, os: OutputStream) = {
val output = new Output(os);
this.kryo.writeClassAndObject(output, o);
output.flush();
}
def deserialize(is: InputStream): Any = {
kryo.readClassAndObject(new Input(is));
}
}
val k = new KryoSO
val b = new java.io.ByteArrayOutputStream
val timestamp = new java.sql.Timestamp(System.currentTimeMillis())
k.serialize(timestamp, b)
val result = k.deserialize(new java.io.ByteArrayInputStream(b.toByteArray()))
println(timestamp)
println(result.getClass)
println(result.isInstanceOf[java.sql.Timestamp])
println(timestamp == result)
Result :
2013-02-07 10:59:19.482
class java.sql.Timestamp
true
true