Spark + Java - Get Results from a dataset

Spark + Java - Get Results from a dataset - java

I have a small dataset that has the population data by country on HDFS. I have written the code to parse it and load it into Dataset<Row>
SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
SparkContext context = new SparkContext(conf);
SparkSession sparkSession = new SparkSession(context);
Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load(args[1]);
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
The console shows the data correctly -
+-----------------------+-------------------+-------------+---------------+----------+
|countriesAndTerritories| location| continent|population_year|population|
+-----------------------+-------------------+-------------+---------------+----------+
| Afghanistan| Afghanistan| Asia| 2020| 38928341|
| Albania| Albania| Europe| 2020| 2877800|
| Algeria| Algeria| Africa| 2020| 43851043|
| Andorra| Andorra| Europe| 2020| 77265|
However, I want to get the population of United States into an int variable.
The query to choose the population is
Dataset<String>xdc = df.select(col("population"))
.where(col("location").equalTo("United States")).limit(1)
But how do I get the contents of it into int variable?

You can try that:
int v = Integer.parseInt(
df.select(col("population"))
.where(col("location").equalTo("United States"))
.limit(1)
.first()
.get(0)
.toString()
);

Related

Exception: Complete output mode not supported

I created sparkStreaming Simulation for my tutorial. When I do the outputMode ("complete") operation, I get an error.
ERROR:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;
My dataset example:
2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222222222221,7.3888888888888875,0.89,14.1197,251.0,15.826300000000002,0.0,1015.13,Partly cloudy throughout the day.
First process code (Partition(summary)):
System.setProperty("hadoop.home.dir","C:\\hadoop-common-2.2.0-bin-master");
SparkSession sparkSession = SparkSession.builder()
.appName("SparkStreamingMessageListener")
.master("local")
.getOrCreate();
StructType schema = new StructType()
.add("Formatted Date", "String")
.add("Summary","String")
.add("Precip Type", "String")
.add("Temperature", "Double")
.add("Apparent Temperature", "Double")
.add("Humidity","Double")
.add("Wind Speed (km/h)","Double")
.add("Wind Bearing (degrees)","Double")
.add("Visibility (km)","Double")
.add("Loud Cover","Double")
.add("Pressure(milibars)","Double")
.add("Dailiy Summary","String");
Dataset<Row> formatted_date = sparkSessionDataFrame.read().schema(schema).option("header", true).csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\SparkStreamingListener\\archivecsv\\weatherHistory.csv");
Dataset<Row> avg = formatted_date.groupBy("Summary", "Precip Type").avg("Temperature").sort(functions.desc("avg(Temperature)"));
formatted_date.write().partitionBy("Summary").csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\SparkStreamingListener\\archivecsv\\weatherHistoryFile\\");
Second listener process code:
SparkSession sparkSession = SparkSession.builder()
.appName("SparkStreamingMessageListener1")
.master("local")
.getOrCreate();
StructType schema1 = new StructType()
.add("Formatted Date", "String")
.add("Precip Type", "String")
.add("Temperature", "Double")
.add("Apparent Temperature", "Double")
.add("Humidity","Double")
.add("Wind Speed (km/h)","Double")
.add("Wind Bearing (degrees)","Double")
.add("Visibility (km)","Double")
.add("Loud Cover","Double")
.add("Pressure(milibars)","Double")
.add("Dailiy Summary","String");
Dataset<Row> rawData = sparkSession.readStream().schema(schema1).option("sep", ",").csv("C:\\Users\\Kaan\\Desktop\\Kaan Proje\\sparkStreamingWheather\\*");
Dataset<Row> heatData = rawData.select("Temperature", "Precip Type").where("Temperature>10");
StreamingQuery start = heatData.writeStream().outputMode("complete").format("console").start();
start.awaitTermination();
I created a Streaming simulation by copying the partitioned files to the specified Listener file path.
I would be glad if you help.Thanks.

The error is pretty specific in telling what the actual problem is: the output mode complete is not supported for the type of your query.
As stated in the Structured Streaming Guide on OutputeModes:
"Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table."
This issue will be solved when selecting the append mode:
StreamingQuery start = heatData.writeStream().outputMode("append").format("console").start()

Convert sql data to Json Array [java spark]

I have dataframe , wanted to convert into JSON ARRAY Please find the example below
Dataframe
+------------+--------------------+----------+----------------+------------------+--------------
| Name| id|request_id|create_timestamp|deadline_timestamp|
+------------+--------------------+----------+----------------+------------------+--------------
| Freeform|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
| D23|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
| Stores|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
|VacationClub|59bbe3ad-f487-44| htvjiwmfe| 1589155200000| 1591272659556
Wanted in Json Like below:
[
{
"testname":"xyz",
"systemResponse":[
{
"name":"FGH",
"id":"59bbe3ad-f487-44",
"request_id":1590791280,
"create_timestamp":1590799280
},
{
"name":"FGH",
"id":"59bbe3ad-f487-44",
"request_id":1590791280,
"create_timestamp":1590799280,
}
]
}
]

You can define 2 beans
Create Array from the 1st DF as Array of inner Beans
Define a parent bean with testname and requestDetailArray as Array
Please also find code inline comments
object DataToJsonArray {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load you dataframe
val requestDetailArray = List(
("Freeform", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("D23", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("Stores", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556"),
("VacationClub", "59bbe3ad-f487-44", "htvjiwmfe", "1589155200000", "1591272659556")
).toDF
//Map your Dataframe to RequestDetails bean
.map(row => RequestDetails(row.getString(0), row.getString(1), row.getString(2), row.getString(3), row.getString(4)))
//Collect it as Array
.collect()
//Create another data frme with List[BaseClass] and set the (testname,Array[RequestDetails])
List(BaseClass("xyz", requestDetailArray)).toDF()
.write
//Output your Dataframe as JSON
.json("/json/output/path")
}
}
case class RequestDetails(Name: String, id: String, request_id: String, create_timestamp: String, deadline_timestamp: String)
case class BaseClass(testname: String = "xyz", systemResponse: Array[RequestDetails])

Check below code.
import org.apache.spark.sql.functions._
df.withColumn("systemResponse",
array(
struct("id","request_id","create_timestamp","deadline_timestamp").as("data")
)
)
.select("systemResponse")
.toJSON
.select(col("value").as("json_data"))
.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
+-----------------------------------------------------------------------------------------------------------------------------------------------+
Updated
scala> :paste
// Entering paste mode (ctrl-D to finish)
df.withColumn("systemResponse",
array(
struct("id","request_id","create_timestamp","deadline_timestamp").as("data")
)
)
.withColumn("testname",lit("xyz"))
.select("testname","systemResponse")
.toJSON
.select(col("value").as("json_data"))
.show(false)
// Exiting paste mode, now interpreting.
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json_data |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
|{"testname":"xyz","systemResponse":[{"id":"59bbe3ad-f487-44","request_id":"htvjiwmfe","create_timestamp":"1589155200000","deadline_timestamp":"1591272659556"}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

Java Apache Spark flatMaps & Data Wrangling

I have to pivot the data in a file and then store it in another file. I am having some difficulty pivoting the data.
I have multiple files, that contain data which looks somewhat like I show below. The columns are variable lengths. I am trying to merge the files, first. But for some reason, the output is not correct. I haven't even tried the pivot method, but am not sure how to use it either.
How can this be achieved?
File 1:
0,26,27,30,120
201008,100,1000,10,400
201009,200,2000,20,500
201010,300,3000,30,600
File 2:
0,26,27,30,120,145
201008,100,1000,10,400,200
201009,200,2000,20,500,100
201010,300,3000,30,600,150
File 3:
0,26,27,120,145
201008,100,10,400,200
201009,200,20,500,100
201010,300,30,600,150
Output:
201008,26,100
201008,27,1000
201008,30,10
201008,120,400
201008,145,200
201009,26,200
201009,27,2000
201009,30,20
201009,120,500
201009,145,100
.....
I am not quite familiar with Spark, but am trying to use flatMap and flatMapValues. I am not sure how I can use it for now, but would appreciate some guidance.
import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.SparkSession;
import lombok.extern.slf4j.Slf4j;
#Slf4j
public class ExecutionTest {
public static void main(String[] args) {
Logger.getLogger("org.apache").setLevel(Level.WARN);
Logger.getLogger("org.spark_project").setLevel(Level.WARN);
Logger.getLogger("io.netty").setLevel(Level.WARN);
log.info("Starting...");
// Step 1: Create a SparkContext.
boolean isRunLocally = Boolean.valueOf(args[0]);
String filePath = args[1];
SparkConf conf = new SparkConf().setAppName("Variable File").set("serializer",
"org.apache.spark.serializer.KryoSerializer");
if (isRunLocally) {
log.info("System is running in local mode");
conf.setMaster("local[*]").set("spark.executor.memory", "2g");
}
SparkSession session = SparkSession.builder().config(conf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
jsc.textFile(filePath, 2)
.map(new Function<String, String[]>() {
private static final long serialVersionUID = 1L;
#Override
public String[] call(String v1) throws Exception {
return StringUtils.split(v1, ",");
}
})
.foreach(new VoidFunction<String[]>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String[] t) throws Exception {
for (String string : t) {
log.info(string);
}
}
});
}
}

Solution in Scala as I am not a JAVA person, you should be able to adapt. And add sorting, cache, etc.
Data is as follows, 3 files with duplicate entry evident, get rid of that if you do not want.
0, 5,10, 15 20
202008, 5,10, 15, 20
202009,10,20,100,200
8 rows generated above.
0,888,999
202008, 5, 10
202009, 10, 20
4 rows generated above.
0, 5
202009,10
1 row, which is a duplicate.
// Bit lazy with columns names, but anyway.
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val inputPath: String = "/FileStore/tables/g*.txt"
val rdd = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)]
.rdd
val rdd2 = rdd.zipWithIndex
val rdd3 = rdd2.map(x => (x._1._1, x._2, x._1._2.split(",").toList.map(_.toInt)))
val rdd4 = rdd3.map { case (pfx, pfx2, list) => (pfx,pfx2,list.zipWithIndex) }
val df = rdd4.toDF()
df.show(false)
df.printSchema()
val df2 = df.withColumn("rankF", row_number().over(Window.partitionBy($"_1").orderBy($"_2".asc)))
df2.show(false)
df2.printSchema()
val df3 = df2.withColumn("elements", explode($"_3"))
df3.show(false)
df3.printSchema()
val df4 = df3.select($"_1", $"rankF", $"elements".getField("_1"), $"elements".getField("_2")).toDF("fn", "line_num", "val", "col_pos")
df4.show(false)
df4.printSchema()
df4.createOrReplaceTempView("df4temp")
val df51 = spark.sql("""SELECT hdr.fn, hdr.line_num, hdr.val AS pfx, hdr.col_pos
FROM df4temp hdr
WHERE hdr.line_num <> 1
AND hdr.col_pos = 0
""")
df51.show(100,false)
val df52 = spark.sql("""SELECT t1.fn, t1.val AS val1, t1.col_pos, t2.line_num, t2.val AS val2
FROM df4temp t1, df4temp t2
WHERE t1.col_pos <> 0
AND t1.col_pos = t2.col_pos
AND t1.line_num <> t2.line_num
AND t1.line_num = 1
AND t1.fn = t2.fn
""")
df52.show(100,false)
df51.createOrReplaceTempView("df51temp")
df52.createOrReplaceTempView("df52temp")
val df53 = spark.sql("""SELECT DISTINCT t1.pfx, t2.val1, t2.val2
FROM df51temp t1, df52temp t2
WHERE t1.fn = t2.fn
AND t1.line_num = t2.line_num
""")
df53.show(false)
returns:
+------+----+----+
|pfx |val1|val2|
+------+----+----+
|202008|888 |5 |
|202009|999 |20 |
|202009|20 |200 |
|202008|5 |5 |
|202008|10 |10 |
|202009|888 |10 |
|202008|15 |15 |
|202009|5 |10 |
|202009|10 |20 |
|202009|15 |100 |
|202008|20 |20 |
|202008|999 |10 |
+------+----+----+
What we see is Data Wrangling requiring massaged data for tempview creations and JOINing with SQL appropriately.
The key here is to know how to massage the data to make things easy. Note no groupBy etc. Per file, with varying length stuff, JOINing not attempted in RDD, too inflexible. Rank shows line#, so you know the first line with the 0 business.
This is what we call Data Wrangling. This is what we also call hard work for a few points on SO. This is one of my best efforts, and also one of the last of such efforts.
Weakness of solution is a lot of work to get 1st record of a file, there are alternatives. https://www.cyberciti.biz/faq/unix-linux-display-first-line-of-file/ preprocesing is what I would realistically consider.

Java Spark : Spark Bug Workaround for Datasets Joining with unknow Join Column Names

I am using Spark 2.3.1 with Java.
I have encountered what (I think), is this known bug of Spark.
Here is my code :
public Dataset<Row> compute(Dataset<Row> df1, Dataset<Row> df2, List<String> columns){
Seq<String> columns_seq = JavaConverters.asScalaIteratorConverter(columns.iterator()).asScala().toSeq();
final Dataset<Row> join = df1.join(df2, columns_seq);
join.show()
join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();
return join;
}
I call my code like this :
Dataset<Row> myNewDF = compute(MyDataset1, MyDataset2, Arrays.asList("field1","field2","field3","field4"));
Note : MyDataset1 and MyDataset2 are two datasets that come from the same Dataset MyDataset0 with multiple different transformations.
On the join.show() line, I get the following error :
2018-08-03 18:48:43 - ERROR main Logging$class - - - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 235, Column 21: Expression "project_isNull_2" is not an rvalue
at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
at org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
...
2018-08-03 18:48:47 - WARN main Logging$class - - - Whole-stage codegen disabled for plan (id=7):
But it does not stop the execution and still displays the content of the dataset.
Then, on the line join.withColumn("newColumn", abs(col("value1").minus(col("value2")))).show();
I get the error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) 'value2,'value1 missing from field6#16,field7#3,field8#108,field5#0,field9#4,field10#28,field11#323,value1#298,field12#131,day#52,field3#119,value2#22,field2#35,field1#43,field4#144 in operator 'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]. Attribute(s) with the same name appear in the operation: value2,value1. Please check if the right attribute(s) are used.;;
'Project [field1#43, field2#35, field3#119, field4#144, field5#0, field6#16, value2#22, field7#3, field9#4, field10#28, day#52, field8#108, field12#131, value1#298, field11#323, abs(('value1 - 'value2)) AS newColumn#2579]
+- AnalysisBarrier
...
This error end the program.
The workaround proposed Mijung Kim on the Jira Issue is to create a Dataset clone thanks to toDF(Columns). But in my case, where the column names used for the join are not known in advance (I only have a List), I can't use this workaround.
Is there another way to get around this very annoying bug ?

Try to call this method:
private static Dataset<Row> cloneDataset(Dataset<Row> ds) {
List<Column> filterColumns = new ArrayList<>();
List<String> filterColumnsNames = new ArrayList<>();
scala.collection.Iterator<StructField> it = ds.exprEnc().schema().toIterator();
while (it.hasNext()) {
String columnName = it.next().name();
filterColumns.add(ds.col(columnName));
filterColumnsNames.add(columnName);
}
ds = ds.select(JavaConversions.asScalaBuffer(filterColumns).seq()).toDF(scala.collection.JavaConverters.asScalaIteratorConverter(filterColumnsNames.iterator()).asScala().toSeq());
return ds;
}
on both datasets just before the join like this :
df1 = cloneDataset(df1);
df2 = cloneDataset(df2);
final Dataset<Row> join = df1.join(df2, columns_seq);
// or ( based on Nakeuh comment )
final Dataset<Row> join = cloneDataset(df1.join(df2, columns_seq));

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

I'm running a Spark application (Spark 1.6.3 cluster), which does some calculations on 2 small data sets, and writes the result into an S3 Parquet file.
Here is my code:
public void doWork(JavaSparkContext sc, Date writeStartDate, Date writeEndDate, String[] extraArgs) throws Exception {
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
S3Client s3Client = new S3Client(ConfigTestingUtils.getBasicAWSCredentials());
boolean clearOutputBeforeSaving = false;
if (extraArgs != null && extraArgs.length > 0) {
if (extraArgs[0].equals("clearOutput")) {
clearOutputBeforeSaving = true;
} else {
logger.warn("Unknown param " + extraArgs[0]);
}
}
Date currRunDate = new Date(writeStartDate.getTime());
while (currRunDate.getTime() < writeEndDate.getTime()) {
try {
SparkReader<FirstData> sparkReader = new SparkReader<>(sc);
JavaRDD<FirstData> data1 = sparkReader.readDataPoints(
inputDir,
currRunDate,
getMinOfEndDateAndNextDay(currRunDate, writeEndDate));
// Normalize to 1 hours & 0.25 degrees
JavaRDD<FirstData> distinctData1 = data1.distinct();
// Floor all (distinct) values to 6 hour windows
JavaRDD<FirstData> basicData1BySixHours = distinctData1.map(d1 -> new FirstData(
d1.getId(),
TimeUtils.floorTimePerSixHourWindow(d1.getTimeStamp()),
d1.getLatitude(),
d1.getLongitude()));
// Convert Data1 to Dataframes
DataFrame data1DF = sqlContext.createDataFrame(basicData1BySixHours, FirstData.class);
data1DF.registerTempTable("data1");
// Read Data2 DataFrame
String currDateString = TimeUtils.getSimpleDailyStringFromDate(currRunDate);
String inputS3Path = basedirInput + "/dt=" + currDateString;
DataFrame data2DF = sqlContext.read().parquet(inputS3Path);
data2DF.registerTempTable("data2");
// Join data1 and data2
DataFrame mergedDataDF = sqlContext.sql("SELECT D1.Id,D2.beaufort,COUNT(1) AS hours " +
"FROM data1 as D1,data2 as D2 " +
"WHERE D1.latitude=D2.latitude AND D1.longitude=D2.longitude AND D1.timeStamp=D2.dataTimestamp " +
"GROUP BY D1.Id,D1.timeStamp,D1.longitude,D1.latitude,D2.beaufort");
// Create histogram per ID
JavaPairRDD<String, Iterable<Row>> mergedDataRows = mergedDataDF.toJavaRDD().groupBy(md -> md.getAs("Id"));
JavaRDD<MergedHistogram> mergedHistogram = mergedDataRows.map(new MergedHistogramCreator());
logger.info("Number of data1 results: " + data1DF.select("lId").distinct().count());
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
logger.info("Number of results with beaufort histograms: " + mergedDataDF.select("Id").distinct().count());
// Save to parquet
String outputS3Path = basedirOutput + "/dt=" + TimeUtils.getSimpleDailyStringFromDate(currRunDate);
if (clearOutputBeforeSaving) {
writeWithCleanup(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext, s3Client);
} else {
write(outputS3Path, mergedHistogram, MergedHistogram.class, sqlContext);
}
} finally {
TimeUtils.progressToNextDay(currRunDate);
}
}
}
public void write(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass, SQLContext sqlContext) {
// Apply a schema to an RDD of JavaBeans and save it as Parquet.
DataFrame fullDataDF = sqlContext.createDataFrame(outputRDD, outputClass);
fullDataDF.write().parquet(outputS3Path);
}
public void writeWithCleanup(String outputS3Path, JavaRDD<MergedHistogram> outputRDD, Class outputClass,
SQLContext sqlContext, S3Client s3Client) {
String fileKey = S3Utils.getS3Key(outputS3Path);
String bucket = S3Utils.getS3Bucket(outputS3Path);
logger.info("Deleting existing dir: " + outputS3Path);
s3Client.deleteAll(bucket, fileKey);
write(outputS3Path, outputRDD, outputClass, sqlContext);
}
public Date getMinOfEndDateAndNextDay(Date startTime, Date proposedEndTime) {
long endOfDay = startTime.getTime() - startTime.getTime() % MILLIS_PER_DAY + MILLIS_PER_DAY ;
if (endOfDay < proposedEndTime.getTime()) {
return new Date(endOfDay);
}
return proposedEndTime;
}
The size of data1 is around 150,000 and data2 is around 500,000.
What my code does is basically does some data manipulation, merges the 2 data objects, does a bit more manipulation, prints some statistics and saves to parquet.
The spark has 25GB of memory per server, and the code runs fine.
Each iteration takes about 2-3 minutes.
The problem starts when I run it on a large set of dates.
After a while, I get an OutOfMemory:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.collection.immutable.List.$colon$colon$colon(List.scala:127)
at org.json4s.JsonDSL$JsonListAssoc.$tilde(JsonDSL.scala:98)
at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:139)
at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:72)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:164)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:38)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:87)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:72)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:71)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:70)
Last time it ran, it crashed after 233 iterations.
The line it crashed on was this:
logger.info("Number of coordinates with data: " + data1DF.select("longitude","latitude").distinct().count());
Can anyone please tell me what can be the reason for the eventual crashes?

I'm not sure that everyone will find this solution viable, but upgrading the Spark cluster to 2.2.0 seems to have resolved the issue.
I have ran my application for several days now, and had no crashes yet.

This error occurs when GC takes up over 98% of the total execution time of process. You can monitor the GC time in your Spark Web UI by going to stages tab in http://master:4040.
Try increasing the driver/executor(whichever is generating this error) memory using spark.{driver/executor}.memory by --conf while submitting the spark application.
Another thing to try is to change the garbage collector that the java is using. Read this article for that: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html. It very clearly explains why GC overhead error occurs and which garbage collector is best for your application.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark + Java - Get Results from a dataset - java

You can try that: int v = Integer.parseInt( df.select(col("population")) .where(col("location").equalTo("United States")) .limit(1) .first() .get(0) .toString() );

Related

Exception: Complete output mode not supported

Convert sql data to Json Array [java spark]

Java Apache Spark flatMaps & Data Wrangling

Java Spark : Spark Bug Workaround for Datasets Joining with unknow Join Column Names

Spark DataFrame java.lang.OutOfMemoryError: GC overhead limit exceeded on long loop run

Categories

Resources