Deserialising spark dataframe(dataset) in java using protobuf

Deserialising spark dataframe(dataset) in java using protobuf - java

How can I do something similar to below given code in Java.
I basically need to deserialise a streaming data in Java using protobuf(which is a spark dataframe and in java its Dataset)
sparkSession.readStream
.format("kafka")
.option("subscribe", topic)
.option("kafka.bootstrap.servers", bootstrapServers)
.load()
.selectExpr("key", "value") // Selecting only key & value
.as\[(Array\[Byte\], Array\[Byte\])\]
.flatMap {
case (key, value) =>
for {
deserializedKey <- Try {
keyDeserializer.deserialize(topic, key)
}.toOption
deserializedValue <- Try {
valueDeserializer.deserialize(topic, value)
}.toOption
} yield (deserializedKey, deserializedValue)
}

Related

Apache Kafka - Implementing a KTable

I am new to Kafka Streams API and I am trying to create a KTable. I have an input topic: s-order-topic, which is a json format message, as shown below.
{ "current_ts": "2019-12-24 13:16:40.316952",
"primary_keys": ["ID"],
"before": null,
"tokens": {"txid":"3.17.2493",
"csn":"64913009"},
"op_type":"I",
"after": { "CODE":"AAAA41",
"STATUS":"COMPLETED",
"ID":24},
"op_ts":"2019-12-24 13:16:40.316941",
"table":"S_ORDER"}
I read messages from this topic and I want to create a KTable that has as key, the field "after":"ID" and for value all the fields inside the "after" field (except for "ID").
I have successfully created a KTable only when I use the default aggregate functions i.e count. But I have difficulty creating my own aggregate function. Below I present the part of the code that I try to create the KTable.
KTable<Long, String> s_table = builder.stream("s-order-topic", Consumed.with(Serdes.Long(),Serdes.String()))
.mapValues(value -> {
String time;
JSONObject json = new JSONObject(value);
if (json.getString("op_type").equals("I")) {
time = "after";
}else {
time = "before";
}
JSONObject json2 = new JSONObject(json.getJSONObject(time).toString());
return json2.toString();
})
.groupBy((key, value) -> {
JSONObject json = new JSONObject(value);
return json.getLong("ID");
}, Grouped.with(Serdes.Long(), Serdes.String()))
.aggregate( ... );
How can I implement this KTable?
Am I approaching the problem correctly?
(mapValues -> keep only the "before"/"after" field. groupBy -> Make the ID the key of the message. Aggregate -> ? )

I figured out a solution for my case. I implemented the KTable as shown below:
KTable<String, String> s_table = builder.stream("s-order-topic", Consumed.with(Serdes.String(),Serdes.String()))
.mapValues(value -> {
String time;
JSONObject json = new JSONObject(value);
if (json.getString("op_type").equals("I")) {
time = "after";
}else {
time = "before";
}
JSONObject json2 = new JSONObject(json.getJSONObject(time).toString());
return json2.toString();
})
.groupBy((key, value) -> {
JSONObject json = new JSONObject(value);
return String.valueOf(json.getLong("ID"));
}, Grouped.with(Serdes.String(), Serdes.String()))
.reduce((prev,newval)->newval);
The aggregate function is not suitable for this case, instead I used the reduce function.
The output from the console consumer is shown below:
15 {"CODE":"AAAA17","STATUS":"PENDING","ID":15}
18 {"CODE":"AAAA50","STATUS":"SUBMITTED","ID":18}
4 {"CODE":"AAAA80","STATUS":"SUBMITTED","ID":4}
19 {"CODE":"AAAA83","STATUS":"SUBMITTED","ID":19}
18 {"CODE":"AAAA33","STATUS":"COMPLETED","ID":18}
5 {"CODE":"AAAA38","STATUS":"PENDING","ID":5}
10 {"CODE":"AAAA1","STATUS":"COMPLETED","ID":10}
3 {"CODE":"AAAA68","STATUS":"NOT COMPLETED","ID":3}
9 {"CODE":"AAAA89","STATUS":"PENDING","ID":9}

Backport Java 8 Lambda code - JSON byte array to Avro

I need to backport some Java 8 code from this library:
https://github.com/allegro/json-avro-converter
Could someone more experienced with Java 8 confirm whether the following code for converting JSON encoded byte arrays to an Avro is effectively identical, and that there isn't a silly edge case I've missed?
With lambdas:
private GenericData.Record readRecord(Map<String,Object> json, Schema schema, Deque<String> path) {
GenericRecordBuilder record = new GenericRecordBuilder(schema);
json.entrySet().forEach(entry ->
ofNullable(schema.getField(entry.getKey()))
.ifPresent(field -> record.set(field, read(field, field.schema(), entry.getValue(), path, false))));
return record.build();
}
My modifications for JDK7:
private GenericData.Record readRecord(Map<String,Object> json, Schema schema, Deque<String> path) {
GenericRecordBuilder record = new GenericRecordBuilder(schema);
for (Map.Entry<String, Object> entry : json.entrySet()) {
Schema.Field field = schema.getField(entry.getKey());
if (field != null) {
record.set(field, read(field, field.schema(), entry.getValue(), path, false));
}
else {
// Do nothing
}
}
return record.build();
}

String transformation using Spark

I'm learning Spark, and trying to write quite simple app.
As input I have log string, which looks like
INFO - {timestamp} - {path} - {json message}
INFO - 124534234534534 - test.class - {"message": "something happened"]
I want to pass it to ElasticSearch. So I need to take {timestamp} and put it to new field to {json message}, so it should look like
{"timestamp": "1234343132", "message": "something happened"}
Can someone help me with this transformation using Java?

Create a Function<String, String> which takes a line of log and returns JSON string.
Function<String, String> f = new Function<String, String>() {
public String call(String s) { return ...; }
}
Read data using SparkContext.textFile
JavaSparkContext sc = ...;
JavaRDD<String> rdd = sc.textFile(...)
map created RDD using function defined in point 1.
rdd.map(f);

Spark - How to use SparkContext within classes?

I am building an application in Spark, and would like to use the SparkContext and/or SQLContext within methods in my classes, mostly to pull/generate data sets from files or SQL queries.
For example, I would like to create a T2P object which contains methods that gather data (and in this case need access to the SparkContext):
class T2P (mid: Int, sc: SparkContext, sqlContext: SQLContext) extends Serializable {
def getImps(): DataFrame = {
val imps = sc.textFile("file.txt").map(line => line.split("\t")).map(d => Data(d(0).toInt, d(1), d(2), d(3))).toDF()
return imps
}
def getX(): DataFrame = {
val x = sqlContext.sql("SELECT a,b,c FROM table")
return x
}
}
//creating the T2P object
class App {
val conf = new SparkConf().setAppName("T2P App").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val t2p = new T2P(0, sc, sqlContext);
}
Passing the SparkContext as an argument to the T2P class doesn't work since the SparkContext is not serializable (getting a task not serializable error when creating T2P objects). What is the best way to use the SparkContext/SQLContext inside my classes? Or perhaps is this the wrong way to design a data pull type process in Spark?
UPDATE
Realized from the comments on this post that the SparkContext was not the problem, but that I was using a using a method within a 'map' function, causing Spark to try to serialize the entire class. This would cause the error since SparkContext is not serializable.
def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
//do something
}
def buildUserRollup() = {
this.userRollup = this.userSorted.map(line=>startMetricTo(line, this.startMetric))
}
This results in a 'task not serializable' exception.

I fixed this problem (with the help of the commenters and other StackOverflow users) by creating a separate MetricCalc object to store my startMetricTo() method. Then I changed the buildUserRollup() method to use this new startMetricTo(). This allows the entire MetricCalc object to be serialized without issue.
//newly created object
object MetricCalc {
def startMetricTo(userData: ((Int, String), List[(Int, String)]), startMetric: String) : T2PUser = {
//do something
}
}
//using function in T2P
def buildUserRollup(startMetric: String) = {
this.userRollup = this.userSorted.map(line=>MetricCalc.startMetricTo(line, startMetric))
}

I tried several options, this is what worked eventually for me..
object SomeName extends App {
val conf = new SparkConf()...
val sc = new SparkContext(conf)
implicit val sqlC = SQLContext.getOrCreate(sc)
getDF1(sqlC)
def getDF1(sqlCo: SQLContext): Unit = {
val query1 = SomeQuery here
val df1 = sqlCo.read.format("jdbc").options(Map("url" -> dbUrl,"dbtable" -> query1)).load.cache()
//iterate through df1 and retrieve the 2nd DataFrame based on some values in the Row of the first DataFrame
df1.foreach(x => {
getDF2(x.getString(0), x.getDecimal(1).toString, x.getDecimal(3).doubleValue) (sqlCo)
})
}
def getDF2(a: String, b: String, c: Double)(implicit sqlCont: SQLContext) : Unit = {
val query2 = Somequery
val sqlcc = SQLContext.getOrCreate(sc)
//val sqlcc = sqlCont //Did not work for me. Also, omitting (implicit sqlCont: SQLContext) altogether did not work
val df2 = sqlcc.read.format("jdbc").options(Map("url" -> dbURL, "dbtable" -> query2)).load().cache()
.
.
.
}
}
Note: In the above code, if I omitted (implicit sqlCont: SQLContext) parameter from getDF2 method signature, it would not work. I tried several other options of passing the sqlContext from one method to the other, it always gave me NullPointerException or Task not serializable Excpetion. Good thins is it eventually worked this way, and I could retrieve parameters from a row of the DataFrame1 and use those values in loading the DataFrame 2.

Convert JavaDStream<String> to JavaRDD<String>

I have a JavaDStream which gets the data from an external source. I'm trying to integrate Spark Streaming and SparkSQL. It's known that JavaDStream is made up of JavaRDD's . And i can only apply the function applySchema() when I have a JavaRDD. Please help me to convert it to a JavaRDD. I know there are functions in scala, and its much easier. But help me out in Java.

You can't transform a DStream into an RDD. As you mention, a DStream contains RDDs. The way to get access to the RDDs is by applying a function to each RDD of the DStream using foreachRDD. See the docs: https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function2)

You have to first access all the RDDs inside the DStream using forEachRDD as:
javaDStream.foreachRDD( rdd => {
rdd.collect.foreach({
...
})
})

I hope this helps to covert JavaDstream to JavaRDD!
JavaDStream<String> lines = stream.map(ConsumerRecord::value);
//Create JavaRDD<Row>
lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("value", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset msgDataFrame = spark.createDataFrame(rowRDD, schema);
msgDataFrame.show();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Deserialising spark dataframe(dataset) in java using protobuf - java

Related

Apache Kafka - Implementing a KTable

Backport Java 8 Lambda code - JSON byte array to Avro

String transformation using Spark

Spark - How to use SparkContext within classes?

Convert JavaDStream<String> to JavaRDD<String>

Categories

Resources