How to get only values from kafka sources to spark? - java

I'm getting logs from kafka sources, and put it into spark.
Format of logs which are saved in my hadoop_path looks like this
{"value":"{\"Name\":\"Amy\",\"Age\":\"22\"}"}
{"value":"{\"Name\":\"Jin\",\"Age\":\"26\"}"}
But, I want to make this like
{\"Name\":\"Amy\",\"Age\":\"22\"}
{\"Name\":\"Jin\",\"Age\":\"26\"}
Any kind of solution will be great. (Using pure Java code, Spark SQL, or Kafka)
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)");
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();

Use following :
Dataframe<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
df.printSchema();
StreamingQuery queryone = df.selectExpr("CAST(value AS STRING)")
.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
Make sure, schema contains value as column.

You can get the expected results using Spark as below:
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)")
.withColumn("Name", functions.json_tuple(functions.col("value"),"Name"))
.withColumn("Age", functions.json_tuple(functions.col("value"),"Age"));
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
Basically, you have to create separate columns for each of the fields inside the json string in value column.

I have done it with from_json fucntion!!
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)");
Dataset<Row> dz = dg.select(
from_json(dg.col("value"), DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("Name", StringType,true)
})).getField("Name").alias("Name")
,from_json(dg.col("value"), DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("Age", IntegerType,true)
})).getField("Age").alias("Age")
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();

Related

Exception: Complete output mode not supported

I created sparkStreaming Simulation for my tutorial. When I do the outputMode ("complete") operation, I get an error.
ERROR:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;
My dataset example:
2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222222222221,7.3888888888888875,0.89,14.1197,251.0,15.826300000000002,0.0,1015.13,Partly cloudy throughout the day.
First process code (Partition(summary)):
System.setProperty("hadoop.home.dir","C:\\hadoop-common-2.2.0-bin-master");
SparkSession sparkSession = SparkSession.builder()
.appName("SparkStreamingMessageListener")
.master("local")
.getOrCreate();
StructType schema = new StructType()
.add("Formatted Date", "String")
.add("Summary","String")
.add("Precip Type", "String")
.add("Temperature", "Double")
.add("Apparent Temperature", "Double")
.add("Humidity","Double")
.add("Wind Speed (km/h)","Double")
.add("Wind Bearing (degrees)","Double")
.add("Visibility (km)","Double")
.add("Loud Cover","Double")
.add("Pressure(milibars)","Double")
.add("Dailiy Summary","String");
Dataset<Row> formatted_date = sparkSessionDataFrame.read().schema(schema).option("header", true).csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\SparkStreamingListener\\archivecsv\\weatherHistory.csv");
Dataset<Row> avg = formatted_date.groupBy("Summary", "Precip Type").avg("Temperature").sort(functions.desc("avg(Temperature)"));
formatted_date.write().partitionBy("Summary").csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\SparkStreamingListener\\archivecsv\\weatherHistoryFile\\");
Second listener process code:
SparkSession sparkSession = SparkSession.builder()
.appName("SparkStreamingMessageListener1")
.master("local")
.getOrCreate();
StructType schema1 = new StructType()
.add("Formatted Date", "String")
.add("Precip Type", "String")
.add("Temperature", "Double")
.add("Apparent Temperature", "Double")
.add("Humidity","Double")
.add("Wind Speed (km/h)","Double")
.add("Wind Bearing (degrees)","Double")
.add("Visibility (km)","Double")
.add("Loud Cover","Double")
.add("Pressure(milibars)","Double")
.add("Dailiy Summary","String");
Dataset<Row> rawData = sparkSession.readStream().schema(schema1).option("sep", ",").csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\sparkStreamingWheather\\*");
Dataset<Row> heatData = rawData.select("Temperature", "Precip Type").where("Temperature>10");
StreamingQuery start = heatData.writeStream().outputMode("complete").format("console").start();
start.awaitTermination();
I created a Streaming simulation by copying the partitioned files to the specified Listener file path.
I would be glad if you help.Thanks.
The error is pretty specific in telling what the actual problem is: the output mode complete is not supported for the type of your query.
As stated in the Structured Streaming Guide on OutputeModes:
"Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table."
This issue will be solved when selecting the append mode:
StreamingQuery start = heatData.writeStream().outputMode("append").format("console").start()

Spark + Java - Get Results from a dataset

I have a small dataset that has the population data by country on HDFS. I have written the code to parse it and load it into Dataset<Row>
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
SparkContext context = new SparkContext(conf);
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load(args[1]);
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
The console shows the data correctly -
+-----------------------+-------------------+-------------+---------------+----------+
|countriesAndTerritories| location| continent|population_year|population|
+-----------------------+-------------------+-------------+---------------+----------+
| Afghanistan| Afghanistan| Asia| 2020| 38928341|
| Albania| Albania| Europe| 2020| 2877800|
| Algeria| Algeria| Africa| 2020| 43851043|
| Andorra| Andorra| Europe| 2020| 77265|
However, I want to get the population of United States into an int variable.
The query to choose the population is
Dataset<String>xdc = df.select(col("population"))
.where(col("location").equalTo("United States")).limit(1)
But how do I get the contents of it into int variable?
You can try that:
int v = Integer.parseInt(
df.select(col("population"))
.where(col("location").equalTo("United States"))
.limit(1)
.first()
.get(0)
.toString()
);

Spark streaming data from kafka topic and write into the text files in external path

I want to read a data from kafka topic and group by key values, and write into text files..
public static void main(String[] args) throws Exception {
SparkSession spark=SparkSession
.builder()
.appName("Sparkconsumer")
.master("local[*]")
.getOrCreate();
SQLContext sqlContext = spark.sqlContext();
SparkContext context = spark.sparkContext();
Dataset<Row>lines=spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe","test-topic")
.load();
Dataset<Row> r= lines.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
r.printSchema();
r.createOrReplaceTempView("basicView");
sqlContext.sql("select * from basicView")
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream()
.outputMode("append")
.format("console")
.option("path","usr//path")
.start()
.awaitTermination();
Following points are misleading in your code:
To read from Kafka and write into a file, you would not need SparkContext or SQLContext,
You are casting your key and value twice into a string,
the format of your output query should not be console if you want to store the data into a file.
An example can be looked up in the Spark Structured Streaming + Kafka Integration Guide and the Spark Structured Streaming Programming Guide
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("Sparkconsumer")
.master("local[*]")
.getOrCreate();
Dataset<Row> lines = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe","test-topic")
.load();
Dataset<Row> r = lines
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
// do some more processing such as 'groupBy'
;
r.writeStream
.format("parquet") // can be "orc", "json", "csv", etc.
.outputMode("append")
.option("path", "path/to/destination/dir")
.option("checkpointLocation", "/path/to/checkpoint/dir")
.start()
.awaitTermination();

Spark Streaming: Aggregating datastream multiple times

I have a spark streaming job which reads the data from multiple kafka topics.
Now I want to aggregate data on multiple window intervals and save it to the database.
Is it possible to do this way? Or I will need a separate spark job to do another level of aggregation.
Would data be lost if dataset processed second time?
Aggregation 1:
private void buildStream1(Dataset<Row> sourceDataset) {
Dataset<Row> query = sourceDataset
.withWatermark("timestamp", "120 seconds")
.select(
col("timestamp"),
col("datacenter"),
col("platform")
)
.groupBy(
functions.window(col("timestamp"), 120 seconds, 60 seconds).as("timestamp"),
col("datacenter"),
col("platform")
)
.agg(
count(lit(1)).as("count")
);
startKafkaStream(query);
}
Aggregation 2:
private void buildStream1(Dataset<Row> sourceDataset) {
Dataset<Row> query = sourceDataset
.withWatermark("timestamp", "10 minutes")
.select(
col("timestamp"),
col("datacenter"),
col("platform")
)
.groupBy(
functions.window(col("timestamp"), 10 minutes, 5 minutes).as("timestamp"),
col("datacenter"),
col("platform")
)
.agg(
count(lit(1)).as("count")
);
startKafkaStream(query);
}
Write both streams:
private void startKafkaStream(Dataset<Row> aggregatedDataset) {
aggregatedDataset
.select(to_json(struct("*")).as("value"))
.writeStream()
.outputMode(OutputMode.Append())
.option("truncate", false)
.format("console")
.trigger(Trigger.ProcessingTime(10 minutes))
.start();
}

Call SparkSession in worker (Spark-SQL, Java)

I'm working with GraphX and SparkSQL and I'm trying to create DataFrame (Dataset) in a graph node. To create a DataFrame I need the SparkSession (spark.createDataFrame(rows,schema)).All I try, I get an error. This is my Code:
SparkSession spark = SparkSession.builder()
.master("spark://home:7077")
.appName("testgraph")
.getOrCreate();
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
//read tree File
JavaRDD<String> tree_file = sc.textFile(args[1]);
JavaPairRDD<String[],Long> node_pair = tree_file.map(l-> l.split(" ")).zipWithIndex();
//Create vertex
RDD<Tuple2<Object, Tuple2<Dataset<Row>,Clauses>>> verteces = node_pair.map(t-> {
List<StructField> fields = new ArrayList<StructField>();
List<Row> rows = new ArrayList<>();
String[] vars = Arrays.copyOfRange(t._1(), 2,t._1().length);
for (int i = 0; i < vars.length; i++) {
fields.add(DataTypes.createStructField(vars[i], DataTypes.BooleanType, true));
}
StructType schema = DataTypes.createStructType(fields);
Dataset<Row> ds = spark.createDataFrame(rows,schema);
return new Tuple2<>((Object)(t._2+1),ds);
}).rdd();
This is the Error I'm getting:
16/08/23 15:25:36 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, 192.168.1.5): java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:328)
at Main.lambda$main$e7daa47c$1(Main.java:62)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1028)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I also tried to get the session inside map() with:
SparkSession ss = SparkSession.builder()
.master("spark://home:7077")
.appName("testgraph")
.getOrCreate();
I also get a Error:
16/08/23 15:00:29 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 7, 192.168.1.5): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I hope someone can help me. I cant find a solution.
THANKS!

Categories

Resources