I created sparkStreaming Simulation for my tutorial. When I do the outputMode ("complete") operation, I get an error.
ERROR:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;
My dataset example:
2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222222222221,7.3888888888888875,0.89,14.1197,251.0,15.826300000000002,0.0,1015.13,Partly cloudy throughout the day.
First process code (Partition(summary)):
System.setProperty("hadoop.home.dir","C:\\hadoop-common-2.2.0-bin-master");
SparkSession sparkSession = SparkSession.builder()
.appName("SparkStreamingMessageListener")
.master("local")
.getOrCreate();
StructType schema = new StructType()
.add("Formatted Date", "String")
.add("Summary","String")
.add("Precip Type", "String")
.add("Temperature", "Double")
.add("Apparent Temperature", "Double")
.add("Humidity","Double")
.add("Wind Speed (km/h)","Double")
.add("Wind Bearing (degrees)","Double")
.add("Visibility (km)","Double")
.add("Loud Cover","Double")
.add("Pressure(milibars)","Double")
.add("Dailiy Summary","String");
Dataset<Row> formatted_date = sparkSessionDataFrame.read().schema(schema).option("header", true).csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\SparkStreamingListener\\archivecsv\\weatherHistory.csv");
Dataset<Row> avg = formatted_date.groupBy("Summary", "Precip Type").avg("Temperature").sort(functions.desc("avg(Temperature)"));
formatted_date.write().partitionBy("Summary").csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\SparkStreamingListener\\archivecsv\\weatherHistoryFile\\");
Second listener process code:
SparkSession sparkSession = SparkSession.builder()
.appName("SparkStreamingMessageListener1")
.master("local")
.getOrCreate();
StructType schema1 = new StructType()
.add("Formatted Date", "String")
.add("Precip Type", "String")
.add("Temperature", "Double")
.add("Apparent Temperature", "Double")
.add("Humidity","Double")
.add("Wind Speed (km/h)","Double")
.add("Wind Bearing (degrees)","Double")
.add("Visibility (km)","Double")
.add("Loud Cover","Double")
.add("Pressure(milibars)","Double")
.add("Dailiy Summary","String");
Dataset<Row> rawData = sparkSession.readStream().schema(schema1).option("sep", ",").csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\sparkStreamingWheather\\*");
Dataset<Row> heatData = rawData.select("Temperature", "Precip Type").where("Temperature>10");
StreamingQuery start = heatData.writeStream().outputMode("complete").format("console").start();
start.awaitTermination();
I created a Streaming simulation by copying the partitioned files to the specified Listener file path.
I would be glad if you help.Thanks.
The error is pretty specific in telling what the actual problem is: the output mode complete is not supported for the type of your query.
As stated in the Structured Streaming Guide on OutputeModes:
"Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table."
This issue will be solved when selecting the append mode:
StreamingQuery start = heatData.writeStream().outputMode("append").format("console").start()
Related
I have a small dataset that has the population data by country on HDFS. I have written the code to parse it and load it into Dataset<Row>
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
SparkContext context = new SparkContext(conf);
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load(args[1]);
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
The console shows the data correctly -
+-----------------------+-------------------+-------------+---------------+----------+
|countriesAndTerritories| location| continent|population_year|population|
+-----------------------+-------------------+-------------+---------------+----------+
| Afghanistan| Afghanistan| Asia| 2020| 38928341|
| Albania| Albania| Europe| 2020| 2877800|
| Algeria| Algeria| Africa| 2020| 43851043|
| Andorra| Andorra| Europe| 2020| 77265|
However, I want to get the population of United States into an int variable.
The query to choose the population is
Dataset<String>xdc = df.select(col("population"))
.where(col("location").equalTo("United States")).limit(1)
But how do I get the contents of it into int variable?
You can try that:
int v = Integer.parseInt(
df.select(col("population"))
.where(col("location").equalTo("United States"))
.limit(1)
.first()
.get(0)
.toString()
);
I'm getting logs from kafka sources, and put it into spark.
Format of logs which are saved in my hadoop_path looks like this
{"value":"{\"Name\":\"Amy\",\"Age\":\"22\"}"}
{"value":"{\"Name\":\"Jin\",\"Age\":\"26\"}"}
But, I want to make this like
{\"Name\":\"Amy\",\"Age\":\"22\"}
{\"Name\":\"Jin\",\"Age\":\"26\"}
Any kind of solution will be great. (Using pure Java code, Spark SQL, or Kafka)
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)");
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
Use following :
Dataframe<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
df.printSchema();
StreamingQuery queryone = df.selectExpr("CAST(value AS STRING)")
.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
Make sure, schema contains value as column.
You can get the expected results using Spark as below:
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)")
.withColumn("Name", functions.json_tuple(functions.col("value"),"Name"))
.withColumn("Age", functions.json_tuple(functions.col("value"),"Age"));
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
Basically, you have to create separate columns for each of the fields inside the json string in value column.
I have done it with from_json fucntion!!
SparkSession spark = SparkSession.builder()
.master("local")
.appName("MYApp").getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", Kafka_source)
.option("subscribe", Kafka_topic)
.option("startingOffsets", "earliest")
.option("failOnDataLoss",false)
.load();
Dataset<Row> dg = df.selectExpr("CAST(value AS STRING)");
Dataset<Row> dz = dg.select(
from_json(dg.col("value"), DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("Name", StringType,true)
})).getField("Name").alias("Name")
,from_json(dg.col("value"), DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("Age", IntegerType,true)
})).getField("Age").alias("Age")
StreamingQuery queryone = dg.writeStream()
.format("json")
.outputMode("append")
.option("checkpointLocation",Hadoop_path)
.option("path",Hadoop_path)
.start();
I want to read a data from kafka topic and group by key values, and write into text files..
public static void main(String[] args) throws Exception {
SparkSession spark=SparkSession
.builder()
.appName("Sparkconsumer")
.master("local[*]")
.getOrCreate();
SQLContext sqlContext = spark.sqlContext();
SparkContext context = spark.sparkContext();
Dataset<Row>lines=spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe","test-topic")
.load();
Dataset<Row> r= lines.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
r.printSchema();
r.createOrReplaceTempView("basicView");
sqlContext.sql("select * from basicView")
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream()
.outputMode("append")
.format("console")
.option("path","usr//path")
.start()
.awaitTermination();
Following points are misleading in your code:
To read from Kafka and write into a file, you would not need SparkContext or SQLContext,
You are casting your key and value twice into a string,
the format of your output query should not be console if you want to store the data into a file.
An example can be looked up in the Spark Structured Streaming + Kafka Integration Guide and the Spark Structured Streaming Programming Guide
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("Sparkconsumer")
.master("local[*]")
.getOrCreate();
Dataset<Row> lines = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe","test-topic")
.load();
Dataset<Row> r = lines
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
// do some more processing such as 'groupBy'
;
r.writeStream
.format("parquet") // can be "orc", "json", "csv", etc.
.outputMode("append")
.option("path", "path/to/destination/dir")
.option("checkpointLocation", "/path/to/checkpoint/dir")
.start()
.awaitTermination();
I am using Spark 2.3.1 with Java.
I have encountered what (I think), is this known bug of Spark.
Here is my code :
public Dataset<Row> compute(Dataset<Row> df1, Dataset<Row> df2, List<String> columns){
Seq<String> columns_seq = JavaConverters.asScalaIteratorConverter(columns.iterator()).asScala().toSeq();
final Dataset<Row> join = df1.join(df2, columns_seq);
join.show()
join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();
return join;
}
I call my code like this :
Dataset<Row> myNewDF = compute(MyDataset1, MyDataset2, Arrays.asList("field1","field2","field3","field4"));
Note : MyDataset1 and MyDataset2 are two datasets that come from the same Dataset MyDataset0 with multiple different transformations.
On the join.show() line, I get the following error :
2018-08-03 18:48:43 - ERROR main Logging$class - - - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
at org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
...
2018-08-03 18:48:47 - WARN main Logging$class - - - Whole-stage codegen disabled for plan (id=7):
But it does not stop the execution and still displays the content of the dataset.
Then, on the line join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();
I get the error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) 'value2,'value1 missing from field6#16,field7#3,field8#108,field5#0,field9#4,field10#28,field11#323,value1#298,field12#131,day#52,field3#119,value2#22,field2#35,field1#43,field4#144 in operator 'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]. Attribute(s) with the same name appear in the operation: value2,value1. Please check if the right attribute(s) are used.;;
'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]
+- AnalysisBarrier
...
This error end the program.
The workaround proposed Mijung Kim on the Jira Issue is to create a Dataset clone thanks to toDF(Columns). But in my case, where the column names used for the join are not known in advance (I only have a List), I can't use this workaround.
Is there another way to get around this very annoying bug ?
Try to call this method:
private static Dataset<Row> cloneDataset(Dataset<Row> ds) {
List<Column> filterColumns = new ArrayList<>();
List<String> filterColumnsNames = new ArrayList<>();
scala.collection.Iterator<StructField> it = ds.exprEnc().schema().toIterator();
while (it.hasNext()) {
String columnName = it.next().name();
filterColumns.add(ds.col(columnName));
filterColumnsNames.add(columnName);
}
ds = ds.select(JavaConversions.asScalaBuffer(filterColumns).seq()).toDF(scala.collection.JavaConverters.asScalaIteratorConverter(filterColumnsNames.iterator()).asScala().toSeq());
return ds;
}
on both datasets just before the join like this :
df1 = cloneDataset(df1);
df2 = cloneDataset(df2);
final Dataset<Row> join = df1.join(df2, columns_seq);
// or ( based on Nakeuh comment )
final Dataset<Row> join = cloneDataset(df1.join(df2, columns_seq));
DataSet<Row> dataSet = sqlContext.sql("some query");
dataSet.registerTempTable("temp_table");
dataset.cache(); // cache 1
sqlContext.cacheTable("temp_table"); // cache 2
So,here my question is will spark cache the dataSet only once or there will be two copies of the same dataSet one as dataSet(cache 1) and other as a table(cache 2)
It will not, or at least it won't in any recent version:
scala> val df = spark.range(1)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.cache
res0: df.type = [id: bigint]
scala> df.createOrReplaceTempView("df")
scala> spark.catalog.cacheTable("df")
2018-01-23 12:33:48 WARN CacheManager:66 - Asked to cache already cached data.