Edit: Using Kryo 1.04
I'm right now serializing a User class that contains a java.sql.Timestamp field in Scala. For some reason, Kryo can't find a zero-arg constructor and throws an error:
Caused by: com.esotericsoftware.kryo.SerializationException: Class cannot be created (missing no-arg constructor): java.sql.Timestamp
Serialization trace:
created (com.threetierlogic.AccountService.models.User)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:688)
at com.esotericsoftware.kryo.Serializer.newInstance(Serializer.java:75)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:200)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:220)
at com.esotericsoftware.kryo.serialize.FieldSerializer.readObjectData(FieldSerializer.java:200)
at com.esotericsoftware.kryo.Serializer.readObject(Serializer.java:61)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:589)
... 84 more
Caused by: java.lang.InstantiationException: java.sql.Timestamp
at java.lang.Class.newInstance0(Class.java:340)
at java.lang.Class.newInstance(Class.java:308)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:676)
... 90 more
This is part of a converter class to convert domain objects for Riak. Here's my converter class:
/**
* Kryo converter for passing domain objects into Riak
*/
class UserConverter(val bucket: String) extends Converter[User] {
def fromDomain(domainObject: User, vclock: VClock): IRiakObject = {
val key = domainObject.guid
if(key == null) throw new NoKeySpecifedException(domainObject)
val kryo = new Kryo()
kryo.register(classOf[User])
kryo.register(classOf[Timestamp])
val ob = new ObjectBuffer(kryo)
val value = ob.writeObject(domainObject)
RiakObjectBuilder.newBuilder(bucket, key)
.withValue(value)
.withVClock(vclock)
.withContentType(Constants.CTYPE_OCTET_STREAM)
.build()
}
def toDomain(riakObject: IRiakObject): User = {
if(riakObject == null) null
val kryo = new Kryo()
kryo.register(classOf[User])
kryo.register(classOf[Timestamp])
val ob = new ObjectBuffer(kryo)
ob.readObject(riakObject.getValue(), classOf[User])
}
}
Do I need to extend Timestamp and create a zero argument constructor? Or is there a better workaround?
If I need to upgrade to 2.20, what's the replacement for ObjectBuffer without writing to a file?
A quick look at the Kryo home page suggests that in the absence of a zero-arg constructor, you can create what Kryo calls an "Instantion Strategy" to handle that class. Look in the "Object Creation" section.
You can do something like this :
class KryoSO {
import com.esotericsoftware.kryo.KryoSerializable
import de.javakaffee.kryoserializers.KryoReflectionFactorySupport
import com.esotericsoftware.kryo.Kryo
import com.esotericsoftware.kryo.Serializer
import java.io.{ InputStream, OutputStream }
import com.esotericsoftware.kryo.io.{ Output, Input }
import java.sql.Timestamp
object TimestampSerializer extends Serializer[Timestamp] {
override def write(kryo: Kryo, output: Output, t: Timestamp): Unit = {
output.writeLong(t.getTime(), true);
}
override def read(kryo: Kryo, input: Input, t: Class[Timestamp]): Timestamp = {
new Timestamp(input.readLong(true));
}
override def copy(kryo: Kryo, original: Timestamp): Timestamp = {
new Timestamp(original.getTime());
}
}
val kryo: Kryo = new KryoReflectionFactorySupport
kryo.addDefaultSerializer(classOf[Timestamp], TimestampSerializer)
def serialize(o: Any, os: OutputStream) = {
val output = new Output(os);
this.kryo.writeClassAndObject(output, o);
output.flush();
}
def deserialize(is: InputStream): Any = {
kryo.readClassAndObject(new Input(is));
}
}
val k = new KryoSO
val b = new java.io.ByteArrayOutputStream
val timestamp = new java.sql.Timestamp(System.currentTimeMillis())
k.serialize(timestamp, b)
val result = k.deserialize(new java.io.ByteArrayInputStream(b.toByteArray()))
println(timestamp)
println(result.getClass)
println(result.isInstanceOf[java.sql.Timestamp])
println(timestamp == result)
Result :
2013-02-07 10:59:19.482
class java.sql.Timestamp
true
true
Related
I am attempting to read avro data from Kafka using Spark Streaming but I receive the following error message:
Streaming Query Exception caught!: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: [id = 8b54c92d-6bbc-4dbc-84d0-55b762c21ba2, runId = 4bc92b3c-343e-4886-b0bc-0777b89f9ec8]
Current Committed Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":17}}}
Current Available Offsets: {KafkaV2[Subscribe[customer-avro4]]: {"customer-avro":{"0":20}}}
Current State: ACTIVE
Thread State: RUNNABLE
Any idea on what the issue might be and how to resolve it? Code is the following (inspired from xebia-france spark-structured-streaming-blog). Actually, I think it ran earlier already but now there is a problem.
import com.databricks.spark.avro.SchemaConverters
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.StreamingQueryException
object AvroConsumer {
private val topic = "customer-avro4"
private val kafkaUrl = "http://localhost:9092"
private val schemaRegistryUrl = "http://localhost:8081"
private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
private val kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
private val avroSchema = schemaRegistryClient.getLatestSchemaMetadata(topic + "-value").getSchema
private val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("ConfluentConsumer")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes)
)
val kafkaDataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("subscribe", topic)
.load()
val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
import org.apache.spark.sql.functions._
val formattedDataFrame = valueDataFrame.select(
from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
val writer = formattedDataFrame
.writeStream
.format("parquet")
.option("checkpointLocation", "hdfs://localhost:9000/data/spark/parquet/checkpoint")
while (true) {
val query = writer.start("hdfs://localhost:9000/data/spark/parquet/total")
try {
query.awaitTermination()
}
catch {
case e: StreamingQueryException => println("Streaming Query Exception caught!: " + e);
}
}
}
object DeserializerWrapper {
val deserializer: AvroDeserializer = kafkaAvroDeserializer
}
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val genericRecord = super.deserialize(bytes).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}
Figured it out - the problem was not as I had thought with the Spark-Kafka integration directly, but with the checkpoint information inside the hdfs filesystem instead. Deleting and recreating the checkpoint folder in hdfs solved it for me.
I am trying to deserialize some Kafka messages that were serialized by Nifi, using Hortonworks Schema Registry
Processor used on the Nifi Side as RecordWritter: AvroRecordSetWriter
Schema write strategy: HWX COntent-Encoded Schema Reference
I am able to deserialize these messsages in other Nifi kafka consumer. However I am trying to deserialize them from my Flink application using Kafka code.
I have the following inside the Kafka deserializer Handler of my Flink Application:
final String SCHEMA_REGISTRY_CACHE_SIZE_KEY = SchemaRegistryClient.Configuration.CLASSLOADER_CACHE_SIZE.name();
final String SCHEMA_REGISTRY_CACHE_EXPIRY_INTERVAL_SECS_KEY = SchemaRegistryClient.Configuration.CLASSLOADER_CACHE_EXPIRY_INTERVAL_SECS.name();
final String SCHEMA_REGISTRY_SCHEMA_VERSION_CACHE_SIZE_KEY = SchemaRegistryClient.Configuration.SCHEMA_VERSION_CACHE_SIZE.name();
final String SCHEMA_REGISTRY_SCHEMA_VERSION_CACHE_EXPIRY_INTERVAL_SECS_KEY = SchemaRegistryClient.Configuration.SCHEMA_VERSION_CACHE_EXPIRY_INTERVAL_SECS.name();
final String SCHEMA_REGISTRY_URL_KEY = SchemaRegistryClient.Configuration.SCHEMA_REGISTRY_URL.name();
Properties schemaRegistryProperties = new Properties();
schemaRegistryProperties.put(SCHEMA_REGISTRY_CACHE_SIZE_KEY, 10L);
schemaRegistryProperties.put(SCHEMA_REGISTRY_CACHE_EXPIRY_INTERVAL_SECS_KEY, 5000L);
schemaRegistryProperties.put(SCHEMA_REGISTRY_SCHEMA_VERSION_CACHE_SIZE_KEY, 1000L);
schemaRegistryProperties.put(SCHEMA_REGISTRY_SCHEMA_VERSION_CACHE_EXPIRY_INTERVAL_SECS_KEY, 60 * 60 * 1000L);
schemaRegistryProperties.put(SCHEMA_REGISTRY_URL_KEY, "http://schema_registry_server:7788/api/v1");
return (Map<String, Object>) HWXSchemaRegistry.getInstance(schemaRegistryProperties).deserialize(message);
And here is the HWXSchemaRegistryCode to deserialize the message:
import com.hortonworks.registries.schemaregistry.avro.AvroSchemaProvider;
import com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient;
import com.hortonworks.registries.schemaregistry.errors.SchemaNotFoundException;
import com.hortonworks.registries.schemaregistry.serdes.avro.AvroSnapshotDeserializer;
public class HWXSchemaRegistry {
private SchemaRegistryClient client;
private Map<String,Object> config;
private AvroSnapshotDeserializer deserializer;
private static HWXSchemaRegistry hwxSRInstance = null;
public static HWXSchemaRegistry getInstance(Properties schemaRegistryConfig) {
if(hwxSRInstance == null)
hwxSRInstance = new HWXSchemaRegistry(schemaRegistryConfig);
return hwxSRInstance;
}
public Object deserialize(byte[] message) throws IOException {
Object o = hwxSRInstance.deserializer.deserialize(new ByteArrayInputStream(message), null);
return o;
}
private static Map<String,Object> properties2Map(Properties config) {
Enumeration<Object> keys = config.keys();
Map<String, Object> configMap = new HashMap<String,Object>();
while (keys.hasMoreElements()) {
Object key = (Object) keys.nextElement();
configMap.put(key.toString(), config.get(key));
}
return configMap;
}
private HWXSchemaRegistry(Properties schemaRegistryConfig) {
_log.debug("Init SchemaRegistry Client");
this.config = HWXSchemaRegistry.properties2Map(schemaRegistryConfig);
this.client = new SchemaRegistryClient(this.config);
this.deserializer = this.client.getDefaultDeserializer(AvroSchemaProvider.TYPE);
this.deserializer.init(this.config);
}
}
But I am getting a 404 HTTP Error code(schema not found). I think this is due to incompatible "protocols" between Nifi configuration and HWX Schema Registry Client implementation, so schema identifier bytes that the client is looking for does not exist on the server, or something like this.
Can someone help on this?
Thank you.
Caused by: javax.ws.rs.NotFoundException: HTTP 404 Not Found
at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:1069)
at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:866)
at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:750)
at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
at org.glassfish.jersey.internal.Errors.process(Errors.java:205)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:390)
at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:748)
at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:404)
at org.glassfish.jersey.client.JerseyInvocation$Builder.get(JerseyInvocation.java:300)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient$14.run(SchemaRegistryClient.java:1054)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient$14.run(SchemaRegistryClient.java:1051)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.getEntities(SchemaRegistryClient.java:1051)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.getAllVersions(SchemaRegistryClient.java:872)
at com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient.getAllVersions(SchemaRegistryClient.java:676)
at HWXSchemaRegistry.(HWXSchemaRegistry.java:56)
at HWXSchemaRegistry.getInstance(HWXSchemaRegistry.java:26)
at SchemaService.deserialize(SchemaService.java:70)
at SchemaService.deserialize(SchemaService.java:26)
at org.apache.flink.streaming.connectors.kafka.internals.KafkaDeserializationSchemaWrapper.deserialize(KafkaDeserializationSchemaWrapper.java:45)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:140)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:712)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
I found a workaround. Since I wasn't able to get this working. I take the first bytes of the byte array to make several calls to schema registry and get the avro schema to deserialize later the rest of the byte array.
First byte (0) is protocol version (I figured out this is a Nifi-specific byte, since I didn't need it).
Next 8 bytes are the schema Id
Next 4 bytes are the schema version
The rest of the bytes are the message itself:
import com.hortonworks.registries.schemaregistry.SchemaMetadataInfo;
import com.hortonworks.registries.schemaregistry.SchemaVersionInfo;
import com.hortonworks.registries.schemaregistry.SchemaVersionKey;
import com.hortonworks.registries.schemaregistry.client.SchemaRegistryClient;
try(SchemaRegistryClient client = new SchemaRegistryClient(this.schemaRegistryConfig)) {
try {
Long schemaId = ByteBuffer.wrap(Arrays.copyOfRange(message, 1, 9)).getLong();
Integer schemaVersion = ByteBuffer.wrap(Arrays.copyOfRange(message, 9, 13)).getInt();
SchemaMetadataInfo schemaInfo = client.getSchemaMetadataInfo(schemaId);
String schemaName = schemaInfo.getSchemaMetadata().getName();
SchemaVersionInfo schemaVersionInfo = client.getSchemaVersionInfo(
new SchemaVersionKey(schemaName, schemaVersion));
String avroSchema = schemaVersionInfo.getSchemaText();
byte[] message= Arrays.copyOfRange(message, 13, message.length);
// Deserialize [...]
}
catch (Exception e)
{
throw new IOException(e.getMessage());
}
}
I also thought that maybe I had to remove the first byte before calling the hwxSRInstance.deserializer.deserialize in my question code, since this byte seems to be a Nifi specific byte to communicate between Nifi processors, but it didn't work.
Next step is to build a cache with the schema texts to avoid calling multiple times the schema registry API.
New info: I will extend my answer to include the avro deserialization part, since it was some troubleshooting for me and I had to inspect Nifi Avro Reader source code to figure out this part (I was getting not valid Avro data exception when trying to use the basic avro deserialization code):
import org.apache.avro.Schema;
import org.apache.avro.file.SeekableByteArrayInput;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DecoderFactory;
private static GenericRecord deserializeMessage(byte[] message, String schemaText) throws IOException {
InputStream in = new SeekableByteArrayInput(message);
Schema schema = new Schema.Parser().parse(schemaText);
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(in, null);
GenericRecord genericRecord = null;
genericRecord = datumReader.read(genericRecord, decoder);
in.close();
return genericRecord;
}
If you want to convert GenericRecord to map, note that String values are not Strings objects, you need to cast the Keys and values of types string:
private static Map<String, Object> avroGenericRecordToMap(GenericRecord record)
{
Map<String, Object> map = new HashMap<>();
record.getSchema().getFields().forEach(field ->
map.put(String.valueOf(field.name()), record.get(field.name())));
// Strings are maped to Utf8 class, so they need to be casted (all the keys of records and those values which are typed as string)
if(map.get("value").getClass() == org.apache.avro.util.Utf8.class)
map.put("value", String.valueOf(map.get("value")));
return map;
}
I am trying to use the org.apache.hadoop.tools.DistCp class to copy some files over into a S3 bucket. However overwrite functionality is not working in spite of explicitly setting the overwrite flag to true
Copying works fine but it does not overwrite if there are existing files. The copy mapper skips those files. I have explicitly set the "overwrite" option to true.
import com.typesafe.scalalogging.LazyLogging
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.tools.{DistCp, DistCpOptions}
import org.apache.hadoop.util.ToolRunner
import scala.collection.JavaConverters._
object distcptest extends App with LazyLogging {
def copytoS3( hdfsSrcFilePathStr: String, s3DestPathStr: String) = {
val hdfsSrcPathList = List(new Path(hdfsSrcFilePathStr))
val s3DestPath = new Path(s3DestPathStr)
val distcpOpt = new DistCpOptions(hdfsSrcPathList.asJava, s3DestPath)
// Overwriting is not working inspite of explicitly setting it to true.
distcpOpt.setOverwrite(true)
val conf: Configuration = new Configuration()
conf.set("fs.s3n.awsSecretAccessKey", "secret key")
conf.set("fs.s3n.awsAccessKeyId", "access key")
conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val distCp: DistCp = new DistCp(conf, distcpOpt)
val filepaths: Array[String] = Array(hdfsSrcFilePathStr, s3DestPathStr)
try {
val distCp_result = ToolRunner.run(distCp, filepaths)
if (distCp_result != 0) {
logger.error(s"DistCP has failed with - error code = $distCp_result")
}
}
catch {
case e: Exception => {
e.printStackTrace()
}
}
}
copytoS3("hdfs://abc/pqr", "s3n://xyz/wst")
}
I think the problem is you called ToolRunner.run(distCp, filepaths).
If you check the source code of DistCp, in run method will overwrite inputOptions, so the DistCpOptions passed to constructor will not work.
#Override
public int run(String[] argv) {
...
try {
inputOptions = (OptionsParser.parse(argv));
...
} catch (Throwable e) {
...
}
...
}
I am trying to use kotlin instead of Java, I cannot find a good way to do with try resource:
Java Code like this:
import org.tensorflow.Graph;
import org.tensorflow.Session;
import org.tensorflow.Tensor;
import org.tensorflow.TensorFlow;
public class HelloTensorFlow {
public static void main(String[] args) throws Exception {
try (Graph g = new Graph()) {
final String value = "Hello from " + TensorFlow.version();
// Construct the computation graph with a single operation, a constant
// named "MyConst" with a value "value".
try (Tensor t = Tensor.create(value.getBytes("UTF-8"))) {
// The Java API doesn't yet include convenience functions for adding operations.
g.opBuilder("Const", "MyConst").setAttr("dtype", t.dataType()).setAttr("value", t).build();
}
// Execute the "MyConst" operation in a Session.
try (Session s = new Session(g);
// Generally, there may be multiple output tensors,
// all of them must be closed to prevent resource leaks.
Tensor output = s.runner().fetch("MyConst").run().get(0)) {
System.out.println(new String(output.bytesValue(), "UTF-8"));
}
}
}
}
I do it in kotlin, I have to do this:
fun main(args: Array<String>) {
val g = Graph();
try {
val value = "Hello from ${TensorFlow.version()}"
val t = Tensor.create(value.toByteArray(Charsets.UTF_8))
try {
g.opBuilder("Const", "MyConst").setAttr("dtype", t.dataType()).setAttr("value", t).build()
} finally {
t.close()
}
var sess = Session(g)
try {
val output = sess.runner().fetch("MyConst").run().get(0)
println(String(output.bytesValue(), Charsets.UTF_8))
} finally {
sess?.close()
}
} finally {
g.close()
}
}
I have try to use use like this:
Graph().use {
it -> ....
}
I got error like this:
Error:(16, 20) Kotlin: Unresolved reference. None of the following candidates is applicable because of receiver type mismatch:
#InlineOnly public inline fun ???.use(block: (???) -> ???): ??? defined in kotlin.io
I just use wrong dependency:
compile "org.jetbrains.kotlin:kotlin-stdlib"
replace it with:
compile "org.jetbrains.kotlin:kotlin-stdlib-jdk8"
How can I call this function from Java? Or do I need a wrapper in scala?
package com.datastax.spark.connector
class DataFrameFunctions(dataFrame: DataFrame) extends Serializable {
...
def createCassandraTable(
keyspaceName: String,
tableName: String,
partitionKeyColumns: Option[Seq[String]] = None,
clusteringKeyColumns: Option[Seq[String]] = None)(
implicit
connector: CassandraConnector = CassandraConnector(sparkContext.getConf)): Unit = {
...
I used the following code :
DataFrameFunctions frameFunctions = new DataFrameFunctions(dfTemp2);
Seq<String> argumentsSeq1 = JavaConversions.asScalaBuffer(Arrays.asList("CategoryName")).seq();
Option<Seq<String>> some1 = new Some<Seq<String>>(argumentsSeq1);
Seq<String> argumentsSeq2 = JavaConversions.asScalaBuffer(Arrays.asList("DealType")).seq();
Option<Seq<String>> some2 = new Some<Seq<String>>(argumentsSeq2);
frameFunctions.createCassandraTable("coupons", "IdealFeeds", some1, some2, connector);