Spark Streaming: Aggregating datastream multiple times

Spark Streaming: Aggregating datastream multiple times - java

I have a spark streaming job which reads the data from multiple kafka topics.
Now I want to aggregate data on multiple window intervals and save it to the database.
Is it possible to do this way? Or I will need a separate spark job to do another level of aggregation.
Would data be lost if dataset processed second time?
Aggregation 1:
private void buildStream1(Dataset<Row> sourceDataset) {
Dataset<Row> query = sourceDataset
.withWatermark("timestamp", "120 seconds")
.select(
col("timestamp"),
col("datacenter"),
col("platform")
)
.groupBy(
functions.window(col("timestamp"), 120 seconds, 60 seconds).as("timestamp"),
col("datacenter"),
col("platform")
)
.agg(
count(lit(1)).as("count")
);
startKafkaStream(query);
}
Aggregation 2:
private void buildStream1(Dataset<Row> sourceDataset) {
Dataset<Row> query = sourceDataset
.withWatermark("timestamp", "10 minutes")
.select(
col("timestamp"),
col("datacenter"),
col("platform")
)
.groupBy(
functions.window(col("timestamp"), 10 minutes, 5 minutes).as("timestamp"),
col("datacenter"),
col("platform")
)
.agg(
count(lit(1)).as("count")
);
startKafkaStream(query);
}
Write both streams:
private void startKafkaStream(Dataset<Row> aggregatedDataset) {
aggregatedDataset
.select(to_json(struct("*")).as("value"))
.writeStream()
.outputMode(OutputMode.Append())
.option("truncate", false)
.format("console")
.trigger(Trigger.ProcessingTime(10 minutes))
.start();
}

Related

How to track committed offset with Spark job for kafka batch

I have a use case where i am writing to a Kafka topic in batches using spark job (no streaming).Initially i pump-in suppose 10 records to Kafka topic and run the spark job which does some processing and finally write to another Kafka topic.
Next time when i push another 5 records and run the spark job, my requirement is to start processing these 5 records only not from starting offset. I need to maintain the committed offset so that spark job should run on next offset position and do the processing.
Here is code from kafka side to fetch the offset:
private static List<TopicPartition> getPartitions(KafkaConsumer consumer, String topic) {
List<PartitionInfo> partitionInfoList = consumer.partitionsFor(topic);
return partitionInfoList.stream().map(x -> new TopicPartition(topic, x.partition())).collect(Collectors.toList());
}
public static void getOffSet(KafkaConsumer consumer) {
List<TopicPartition> topicPartitions = getPartitions(consumer, topic);
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " startingOffSet-> " + consumer.position(x));
});
consumer.assign(topicPartitions);
consumer.seekToEnd(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " endingOffSet-> " + consumer.position(x));
});
topicPartitions.forEach(x -> {
consumer.poll(1000) ;
OffsetAndMetadata offsetAndMetadata = consumer.committed(x);
long position = consumer.position(x);
System.out.printf("Committed: %s, current position %s%n", offsetAndMetadata == null ? null : offsetAndMetadata
.offset(), position);
});
}
Below code is for spark to load the messages from topic which is not working :
Dataset<Row> kafkaDataset = session.read().format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic)
.option("group.id", "test-consumer-group")
.option("startingOffsets","{\"Topic1\":{\"0\":2}}")
.option("endingOffsets", "{\"Topic1\":{\"0\":3}}")
.option("enable.auto.commit","true")
.load();
After above code executes i am again trying to get the offset by calling
getoffset(consumer)
from the topic which always reads from 0 offset and committed offset fetched initially keeps on increasing. I am new to kafka and still figuring out how to handle such scenarion.Please help here.
Initially i had 10 records in my topic, i published another 2 records and here is the o/p:
Output post getoffset method executes :
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
Output post spark code executes for loading messages.
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
I see no diff and . Please take a look and suggest resolution for this sceanario.

How to make spark streaming commit in each batch when limiting Kafka batch size?

To limit the batch size when using Spark streaming, I referenced this answer
There is about 50 millions records stocking (about to be consumed) in Kafka.
The topic is with 3 partitions.
zhihu_comment 0 10906153 28668062 17761909 - - -
zhihu_comment 1 10972464 30271728 19299264 - - -
zhihu_comment 2 10906395 28662007 17755612 - - -
My consumer app:
public final class SparkConsumer {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
String brokers = "device1:9092,device2:9092,device3:9092";
String groupId = "spark";
String topics = "zhihu_comment";
// Create context with a certain seconds batch interval
SparkConf sparkConf = new SparkConf().setAppName("TestKafkaStreaming");
sparkConf.set("spark.streaming.backpressure.enabled", "true");
sparkConf.set("spark.streaming.backpressure.initialRate", "10000");
sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "10000");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(10));
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
kafkaParams.put("enable.auto.commit", true);
kafkaParams.put("max.poll.records", "500");
// Create direct kafka stream with brokers and topics
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
// Get the lines, split them into words, count the words and print
JavaDStream<String> lines = messages.map(ConsumerRecord::value);
lines.count().print();
jssc.start();
jssc.awaitTermination();
}
}
I have limited the consuming size of spark streaming, in my case, I set maxRatePerPartition to 10000, which means it consumed 300000 records per batch in my case.
The question is although spark streaming is able to handle records with specific limit, the current offset showing by kafka is not the offset that spark streaming is handling. As the kafka's current offset suddenly goes down to latest offset:
zhihu_comment 0 28700537 28700676 139 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
zhihu_comment 1 30305102 30305224 122 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
zhihu_comment 2 28695033 28695146 113 consumer-1-ddcb0abd-e206-470d-925a-63ca4dc1d62a /192.168.0.102 consumer-1
It appears that Spark streaming does not commit the offset in each batch, it commits the latest offset at the beginning when it starts to consume!
Is there any way to make spark streaming commit with each batch?
Spark streaming log, proving the records num it consumed each batch:
20/05/04 22:28:13 INFO scheduler.DAGScheduler: Job 15 finished: print at SparkConsumer.java:65, took 0.012606 s
-------------------------------------------
Time: 1588602490000 ms
-------------------------------------------
300000
20/05/04 22:28:13 INFO scheduler.JobScheduler: Finished job streaming job 1588602490000 ms.0 from job set of time 1588602490000 ms

You need to disable
kafkaParams.put("enable.auto.commit", false);
and rather use
messages.foreachRDD(rdd -> {
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
// do here some transformations and action on the rdd, typically like:
rdd.foreachPartition(it -> {
it.foreach(row -> ...)
})
// commit messages
((CanCommitOffsets) messages.inputDStream()).commitAsync(offsetRanges);
});
as described in the Spark + Kafka Integration Guide.
You could also use commitSync for synchronous commits.

MongoDB Java driver much slower than console with $gte/$lte

I'm using MongoDB 4.0.1 with Java driver (MongoDB-driver-sync) 3.8.0
My collection has 564'039 elements with 13 key-values, 2 of which are maps with 10 more key-values.
If I execute the following query in the console, it gives me the results in less than a second:
db.getCollection('tracking_points').find({c: 8, d: 11,
t: {$gte: new Date("2018-08-10"), $lte: new Date("2018-09-10")}
})
But if I execute this in Java it takes more than 30 seconds:
collection.find(
and(
eq("c", clientId),
eq("d", unitId),
gte("t", start),
lte("t", end)
)
).forEach((Block<Document>) document -> {
// nothing here
});
There is an index on "t" (timestamp) and without it, the console find takes few seconds.
How can this be fixed?
Edit: Here is the log from the DB after the java query:
"2018-09-21T08:06:38.842+0300 I COMMAND [conn9236] command fleetman_dev.tracking_points command: count { count: \"tracking_points\", query: {}, $db: \"fleetman_dev\", $readPreference: { mode: \"primaryPreferred\" } } planSummary: COUNT keysExamined:0 docsExamined:0 numYields:0 reslen:45 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_msg 0ms",
"2018-09-21T08:06:38.862+0300 I COMMAND [conn9236] command fleetman_dev.tracking_points command: find { find: \"tracking_points\", filter: { c: 8, d: 11, t: { $gte: new Date(1536526800000), $lte: new Date(1536613200000) } }, $db: \"fleetman_dev\", $readPreference: { mode: \"primaryPreferred\" } } planSummary: IXSCAN { t: 1 } cursorid:38396803834 keysExamined:101 docsExamined:101 numYields:0 nreturned:101 reslen:24954 locks:{ Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { ",
"2018-09-21T08:06:39.049+0300 I COMMAND [conn9236] command fleetman_dev.tracking_points command: getMore { getMore: 38396803834, collection: \"tracking_points\", $db: \"fleetman_dev\", $readPreference: { mode: \"primaryPreferred\" } } originatingCommand: { find: \"tracking_points\", filter: { c: 8, d: 11, t: { $gte: new Date(1536526800000), $lte: new Date(1536613200000) } }, $db: \"fleetman_dev\", $readPreference: { mode: \"primaryPreferred\" } } planSummary: IXSCAN { t: 1 } cursorid:38396803834 keysExamined:33810 doc",

You are using the Java driver correctly but your conclusion - that the Java driver is much slower than the console - is based on an invalid comparison. The two code blocks is your question are not equivalent. In the shell variant you retrieve a cursor. In the Java variant you retrieve a cursor and you walk over the contents of that cursor.
A valid comparison between the Mongo shell and the Java driver would either have to include walking over the cursor in the shell variant, for example:
db.getCollection('tracking_points').find({c: 8, d: 11,
t: {$gte: new Date("2018-08-10"), $lte: new Date("2018-09-10")}
}).forEach(
function(myDoc) {
// nothing here
}
)
Or it would have to remove walking over the cursor from the Java variant, for example:
collection.find(
and(
eq("c", clientId),
eq("d", unitId),
gte("t", start),
lte("t", end)
)
);
Both of these would be more valid forms of comparison. If you run either of those you'll see that the elapsed times are much closer to each other. The follow on question might be 'why does it take 30s to read this data?'. If so, the fact that you can get the cursor back sub second tells us that the issue is not about indexing, instead it is likely to be related to the amount of data being read by the query.
To isolate where the issue is occurring you could gather elasped times for the following:
read the data, iterating over each document but do not parse each document
read the data and parse each document while reading
If the elapsed time for no. 2 is not much more than the elapsed time for no. 1 then you know that the issue is not in parsing and is more likely to be in network transfer. If the elapsed time for no. 2 is much greater than no. 1 then you know that the issue is in parsing and you can dig into the parse call to attribute the elapsed time. It could be constrained resources on the client (CPU and/or memory) or a sub optimal parse implementation. I can't tell at this remove but using the above approach to isolate where the problem resides will at least help you to direct your investigation.

How can I print the content of rows in a Dataset using Java and the Spark SQL?

I would like to do a simple Spark SQL code that reads a file called u.data, that contains the movie ratings, creates a Dataset of Rows, and then print the first rows of the Dataset.
I've had as premise read the file to a JavaRDD, and map the RDD according to a ratingsObject(the object has two parameters, movieID and rating). So I just want to print the first Rows in this Dataset.
I'm using Java language and Spark SQL.
public static void main(String[] args){
App obj = new App();
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").getOrCreate();
Map<Integer,String> movieNames = obj.loadMovieNames();
JavaRDD<String> lines = spark.read().textFile("hdfs:///ml-100k/u.data").javaRDD();
JavaRDD<MovieRatings> movies = lines.map(line -> {
String[] parts = line.split(" ");
MovieRatings ratingsObject = new MovieRatings();
ratingsObject.setMovieID(Integer.parseInt(parts[1].trim()));
ratingsObject.setRating(Integer.parseInt(parts[2].trim()));
return ratingsObject;
});
Dataset<Row> movieDataset = spark.createDataFrame(movies, MovieRatings.class);
Encoder<Integer> intEncoder = Encoders.INT();
Dataset<Integer> HUE = movieDataset.map(
new MapFunction<Row, Integer>(){
private static final long serialVersionUID = -5982149277350252630L;
#Override
public Integer call(Row row) throws Exception{
return row.getInt(0);
}
}, intEncoder
);
HUE.show();
//stop the session
spark.stop();
}
I've tried a lot of possible solutions that I found, but all of them got the same error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, localhost, executor 1): java.lang.ArrayIndexOutOfBoundsException: 1
at com.ericsson.SparkMovieRatings.App.lambda$main$1e634467$1(App.java:63)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
And here is the sample of the u.data file:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
Where the first column represents de UserID, the second MovieID, the third the rating,and the last one is the timestamp.

As mentioned before your data are not space separated.
I'll show you two possible solutions, the first one based on RDD and the second one based on spark sql which is, in general, the best solution in term of performance.
RDD (you should use built in types to reduce the overhead):
public class SparkDriver {
public static void main (String args[]) {
// Create a configuration object and set the name of
// the application
SparkConf conf = new SparkConf().setAppName("application_name");
// Create a spark Context object
JavaSparkContext context = new JavaSparkContext(conf);
// Create final rdd (suppose you have a text file)
JavaPairRDD<Integer,Integer> movieRatingRDD =
contextFile("u.data.txt")
.mapToPair(line -> {(
String[] tokens = line.split("\\s+");
int movieID = Integer.parseInt(tokens[0]);
int rating = Integer.parseInt(tokens[1]);
return new Tuple2<Integer, Integer>(movieID, rating);});
// Keep in mind that take operation takes the first n elements
// and the order is the order of the file.
ArrayList<Tuple2<Integer, Integer> list = new ArrayList<>(movieRatingRDD.take(10));
System.out.println("MovieID\tRating");
for(tuple : list) {
System.out.println(tuple._1 + "\t" + tuple._2);
}
context.close();
}}
SQL
public class SparkDriver {
public static void main(String[] args) {
// Create spark session
SparkSession session = SparkSession.builder().appName("[Spark app sql version]").getOrCreate();
Dataset<MovieRatings> personsDataframe = session.read()
.format("tct")
.option("header", false)
.option("inferSchema", true)
.option("delimiter", "\\s+")
.load("u.data.txt")
.map(row -> {
int movieID = row.getInteger(0);
int rating = row.getInteger(1);
return new MovieRatings(movieID, rating);
}).as(Encoders.bean(MovieRatings.class);
// Stop session
session.stop();
}
}

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

I'm running a Spark application (Spark 1.6.3 cluster), which does some calculations on 2 small data sets, and writes the result into an S3 Parquet file.
Here is my code:
public void doWork(JavaSparkContext sc, Date writeStartDate, Date writeEndDate, String[] extraArgs) throws Exception {
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
S3Client s3Client = new S3Client(ConfigTestingUtils.getBasicAWSCredentials());
boolean clearOutputBeforeSaving = false;
if (extraArgs != null && extraArgs.length > 0) {
if (extraArgs[0].equals("clearOutput")) {
clearOutputBeforeSaving = true;
} else {
logger.warn("Unknown param " + extraArgs[0]);
}
}
Date currRunDate = new Date(writeStartDate.getTime());
while (currRunDate.getTime() < writeEndDate.getTime()) {
try {
SparkReader<FirstData> sparkReader = new SparkReader<>(sc);
JavaRDD<FirstData> data1 = sparkReader.readDataPoints(
inputDir,
currRunDate,
getMinOfEndDateAndNextDay(currRunDate, writeEndDate));
// Normalize to 1 hours & 0.25 degrees
JavaRDD<FirstData> distinctData1 = data1.distinct();
// Floor all (distinct) values to 6 hour windows
JavaRDD<FirstData> basicData1BySixHours = distinctData1.map(d1 -> new FirstData(
d1.getId(),
TimeUtils.floorTimePerSixHourWindow(d1.getTimeStamp()),
d1.getLatitude(),
d1.getLongitude()));
// Convert Data1 to Dataframes
DataFrame data1DF = sqlContext.createDataFrame(basicData1BySixHours, FirstData.class);
data1DF.registerTempTable("data1");
// Read Data2 DataFrame
String currDateString = TimeUtils.getSimpleDailyStringFromDate(currRunDate);
String inputS3Path = basedirInput + "/dt=" + currDateString;
DataFrame data2DF = sqlContext.read().parquet(inputS3Path);
data2DF.registerTempTable("data2");
// Join data1 and data2
DataFrame mergedDataDF = sqlContext.sql("SELECT D1.Id,D2.beaufort,COUNT(1) AS hours " +
"FROM data1 as D1,data2 as D2 " +
"WHERE D1.latitude=D2.latitude AND D1.longitude=D2.longitude AND D1.timeStamp=D2.dataTimestamp " +
"GROUP BY D1.Id,D1.timeStamp,D1.longitude,D1.latitude,D2.beaufort");
// Create histogram per ID
JavaPairRDD<String, Iterable<Row>> mergedDataRows = mergedDataDF.toJavaRDD().groupBy(md -> md.getAs("Id"));
JavaRDD<MergedHistogram> mergedHistogram = mergedDataRows.map(new MergedHistogramCreator());
logger.info("Number of data1 results: " + data1DF.select("lId").distinct().count());
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
logger.info("Number of results with beaufort histograms: " + mergedDataDF.select("Id").distinct().count());
// Save to parquet
String outputS3Path = basedirOutput + "/dt=" + TimeUtils.getSimpleDailyStringFromDate(currRunDate);
if (clearOutputBeforeSaving) {
writeWithCleanup(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext, s3Client);
} else {
write(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext);
}
} finally {
TimeUtils.progressToNextDay(currRunDate);
}
}
}
public void write(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass, SQLContext sqlContext) {
// Apply a schema to an RDD of JavaBeans and save it as Parquet.
DataFrame fullDataDF = sqlContext.createDataFrame(outputRDD, outputClass);
fullDataDF.write().parquet(outputS3Path);
}
public void writeWithCleanup(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass,
SQLContext sqlContext, S3Client s3Client) {
String fileKey = S3Utils.getS3Key(outputS3Path);
String bucket = S3Utils.getS3Bucket(outputS3Path);
logger.info("Deleting existing dir: " + outputS3Path);
s3Client.deleteAll(bucket, fileKey);
write(outputS3Path, outputRDD, outputClass, sqlContext);
}
public Date getMinOfEndDateAndNextDay(Date startTime, Date proposedEndTime) {
long endOfDay = startTime.getTime() - startTime.getTime() % MILLIS_PER_DAY + MILLIS_PER_DAY ;
if (endOfDay < proposedEndTime.getTime()) {
return new Date(endOfDay);
}
return proposedEndTime;
}
The size of data1 is around 150,000 and data2 is around 500,000.
What my code does is basically does some data manipulation, merges the 2 data objects, does a bit more manipulation, prints some statistics and saves to parquet.
The spark has 25GB of memory per server, and the code runs fine.
Each iteration takes about 2-3 minutes.
The problem starts when I run it on a large set of dates.
After a while, I get an OutOfMemory:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
at org.json4s.JsonDSL$JsonListAssoc.$tilde(JsonDSL.scala:98)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:139)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:72)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:164)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:38)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:87)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:71)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:70)
Last time it ran, it crashed after 233 iterations.
The line it crashed on was this:
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
Can anyone please tell me what can be the reason for the eventual crashes?

I'm not sure that everyone will find this solution viable, but upgrading the Spark cluster to 2.2.0 seems to have resolved the issue.
I have ran my application for several days now, and had no crashes yet.

This error occurs when GC takes up over 98% of the total execution time of process. You can monitor the GC time in your Spark Web UI by going to stages tab in http://master:4040.
Try increasing the driver/executor(whichever is generating this error) memory using spark.{driver/executor}.memory by --conf while submitting the spark application.
Another thing to try is to change the garbage collector that the java is using. Read this article for that: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html. It very clearly explains why GC overhead error occurs and which garbage collector is best for your application.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Streaming: Aggregating datastream multiple times - java

Related

How to track committed offset with Spark job for kafka batch

How to make spark streaming commit in each batch when limiting Kafka batch size?

MongoDB Java driver much slower than console with $gte/$lte

How can I print the content of rows in a Dataset using Java and the Spark SQL?

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

Categories

Resources