Spark fails with java.lang.OutOfMemoryError: GC overhead limit exceeded? - java

This is my java code in which I am querying data from Hive using Apache spark sql.
JavaSparkContext ctx = new JavaSparkContext(new SparkConf().setAppName("LoadData").setMaster("MasterUrl"));
HiveContext sqlContext = new HiveContext(ctx.sc());
List<Row> result = sqlContext.sql("Select * from Tablename").collectAsList();
when I run this code it throws java.lang.OutOfMemoryError: GC overhead limit exceeded. How to solve this or how to increase the memory in Spark configuration.

If you are using the spark-shell to run it then you can use the driver-memory to bump the memory limit:
spark-shell --driver-memory Xg [other options]
If the executors are having problems then you can adjust their memory limits with --executor-memory XG
You can find more info how to exactly set them in the guides: submission for executor memory, configuration for driver memory.
#Edit: since you are running it from Netbeans you should be able to pass them as JVM arguments -Dspark.driver.memory=XG and -Dspark.executor.memory=XG. I think it was in Project Properties under Run.

have you found any solutions for your issue yet?
please share them if you have :D
and here is my idea: rdd and also javaRDD has a method toLocalIterator(),
spark document said that
The iterator will consume as much memory as the largest partition in
this RDD.
it means iterator will consume less memory than List if the rdd is devided into many partitions, you can try like this:
Iterator<Row> iter = sqlContext.sql("Select * from Tablename").javaRDD().toLocalIterator();
while (iter.hasNext()){
Row row = iter.next();
//your code here
}
ps: it's just an idea and i haven't tested it yet

Related

How to investigate Java Heap Space errors in jenkins

I have a matrix pipeline job which performs multiple stages (something like ~200) most of which are functional tests whose results are recorded by the following code:
stage('Report') {
script {
def summary = junit allowEmptyResults: true, testResults: "**/artifacts/${product}/test-reports/*.xml"
def buildURL = "${env.BUILD_URL}"
def TestAnalyzer = buildURL.replace("/${env.BUILD_NUMBER}", "/test_results_analyzer")
def TestsURL = buildURL.replace("job/${env.JOB_NAME}/${env.BUILD_NUMBER}", "blue/organizations/jenkins/${env.JOB_NAME}/detail/${env.JOB_NAME}/${env.BUILD_NUMBER}/tests")
slackSend (
color: summary.failCount == 0 ? 'good' : 'warning',
message: "*<${TestsURL}|Test Summary>* for *${env.JOB_NAME}* on *${env.HOSTNAME} - ${product}* <${env.BUILD_URL}| #${env.BUILD_NUMBER}> - <${TestAnalyzer}|${summary.totalCount} Tests>, Failed: ${summary.failCount}, Skipped: ${summary.skipCount}, Passed: ${summary.passCount}"
)
}
}
The problem is that this Report stage regularly fails with the following error:
> Archive JUnit-formatted test results 9m 25s
[2022-11-16T02:51:49.569Z] Recording test results
Java heap space
I have increased the heap space of the jenkins server to 8GB by modifying the systemd service configuration this way:
software-dev#magnet:~$ sudo cat /etc/systemd/system/jenkins.service.d/override.conf
[Service]
Environment="JAVA_OPTS=-Djava.awt.headless=true -Xmx8g"
which was taken into account, because I verified with the following command:
software-dev#magnet:~$ tr '\0' '\n' < /proc/$(pidof java)/cmdline
/usr/bin/java
-Djava.awt.headless=true
-Xmx10g
-jar
/usr/share/java/jenkins.war
--webroot=/var/cache/jenkins/war
--httpPort=8080
I just increased the Heap size to 10GB and I'll wait for the result of this night's build, but I have the feeling that this amount of Heap space really looks excessive, so I'm suspecting that a plugin, maybe the JUnit one, may be buggy and could consume too much memory.
Is anyone aware of such a thing? Could there be workarounds?
More importantly, which methods could I use to try to track if one plugin is consuming too much?
I have notions of Java since my CS degree, but I'm not familiar with the jenkins development ecosystem.
Thank you by advance.
You can try splitting the tests into chunks/batch/groups but this solution requires changes in the code.
More details
https://semaphoreci.com/community/tutorials/how-to-split-junit-tests-in-a-continuous-integration-environment
Grouping JUnit tests

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?
Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

Native memory allocation (mmap) failed to map when using jdbc8.jar and Oracle client 18

I have a java app that extracts large xml from an Oracle 11g database Release 11.2.0.4.0 (using spring 4) and stores them in files. We had a problem extracting data containing multibyte characters. Depending on where in the xml the multibyte was it would sometimes split the multibyte into 2 parts. It looked like the problem was to do with the version of jdbc and the oracle client installed. So we migrated to Oracle client 18 and ojdb8.jar, leaving the code untouched. The multibyte problem was solved, but now instead we have memory issues that were not happening before. The error I get is:
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 256376832 bytes for committing reserved memory.
I have played around with the java command parameters but to no avail. This is what I am running:
java -Dfile.encoding=UTF8 -Doracle.jdbc.timezoneAsRegion=false -Xmx10240m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -XX:NativeMemoryTracking=detail -XX:+StartAttachListener -XX:+UnlockDiagnosticVMOptions -XX:+PrintNMTStatistics
Decreasing Xmx to 512m ended up with running out of heap memory.
I have monitored the application's memory using jcmd VM.native_memory baseline/summary.diff and GC.class_stats and one of the biggest memory consumers are String objects. I was not able to make sense of the rest.
the sql is:
SELECT XML_DATA
FROM table where....
The xml_data column is defined as:
XML_DATA NOT NULL SYS.XMLTYPE STORAGE BINARY
mapping this in java to oracle.xdb.XMLType:
public List<XMLType> extractXmlDataList(String sqlExtractionQuery, Key key) {
MapSqlParameterSource namedSqlParams = createKeyParamMap(key);
List<XMLType> dataList = namedParameterJdbcTemplate.queryForList(sqlExtractionQuery, namedSqlParams, XMLType.class);
return dataList;
}
protected void extractXmlData(Key key) {
List<XMLType> xmlRecs = producerDao.extractXmlDataList(sqlExtractionQuery, patentKey);
if (xmlRecs != null && xmlRecs.size() > 0) {
for (XMLType xmlData : xmlRecs) {
String xmlText = xmlData.getStringVal();
//create nu.xom.Document
Builder parser = new Builder();
Document xmlDocument = parser.build(xmlText, null);
}
}
}
How can moving to Oracle client 18 and jdbc8.jar have affected the memory consumption so much?
The memory issue was solved by using xmlserialize rather than Oracle Java XMLType object. Still using Oracle client 18 and ojdbc8.jar. No further problems with multibyte UTF8 splitting.

How to analyze and limit the memory usage of geotrellis and spark for tiling

Our main goal is that we want to perform operations on a large amount of input data (around 80 GB). The problem is that even for smaller datasets, we often get java heap space or other memory related errors.
Our temporary solution was to simply specify a higher maximum heap size (using -Xmx locally or by setting spark.executor.memory and spark.driver.memory for our spark instance) but this does not seem to generalize well, we still run into errors for bigger datasets or higher zoom levels.
For better understanding, here is the basic concept of what we do with our data:
Load the data using HadoopGeoTiffRDD.spatial(new Path(path))
Map the data to the tiles of some zoom level
val extent = geotrellis.proj4.CRS.fromName("EPSG:4326").worldExtent
val layout = layoutForZoom(zoom, extent)
val metadata: TileLayerMetadata[SpatialKey] = dataSet.collectMetadata[SpatialKey](layout)
val rdd = ContextRDD(dataSet.tileToLayout[SpatialKey](metadata), metadata)
Where layoutForZoom is basically the same as geotrellis.spark.tiling.ZoomedLayoutScheme.layoutForZoom
Then we perform some operations on the entries of the rdd using rdd.map and rdd.foreach for the mapped rdds.
We aggregate the results of four tiles which correspond to a single tile in a higher zoom level using groupByKey
Go to 3 until we reached a certain zoom level
The goal would be: Given a memory limit of X GB, partition and work on the data in a way that we consume at most X GB.
It seems like the tiling of the dataset via tileToLayout already takes too much memory on higher zoom levels (even for very small data sets). Are there any alternatives for tiling and loading the data according to some LayoutDefinition? As far as we understand, HadoopGeoTiffRDD.spatial already splits the dataset into small regions which are then divided into the tiles by tileToLayout. Is it somehow possible to directly load the dataset corresponding to the LayoutDefinition?
In our concrete scenario we have 3 workers with 2GB RAM and 2 cores each. On one of them is running the spark master too, which gets its work via spark-submit from a driver instance. We tried configurations like this:
val conf = new SparkConf().setAppName("Converter").setMaster("spark://IP-ADDRESS:PORT")
.set("spark.executor.memory", "900m") // to be below the available 1024 MB of default slave RAM
.set("spark.memory.fraction", "0.2") // to get more usable heap space
.set("spark.executor.cores", "2")
.set("spark.executor.instances", "3")
An example for a heap space error at the tiling step (step 2):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 5, 192.168.0.2, executor 1): java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ArrayBuilder$ofByte.mkArray(ArrayBuilder.scala:128)
at scala.collection.mutable.ArrayBuilder$ofByte.resize(ArrayBuilder.scala:134)
at scala.collection.mutable.ArrayBuilder$ofByte.sizeHint(ArrayBuilder.scala:139)
at scala.collection.IndexedSeqOptimized$class.slice(IndexedSeqOptimized.scala:115)
at scala.collection.mutable.ArrayOps$ofByte.slice(ArrayOps.scala:198)
at geotrellis.util.StreamingByteReader.getBytes(StreamingByteReader.scala:98)
at geotrellis.raster.io.geotiff.LazySegmentBytes.getBytes(LazySegmentBytes.scala:104)
at geotrellis.raster.io.geotiff.LazySegmentBytes.readChunk(LazySegmentBytes.scala:81)
at geotrellis.raster.io.geotiff.LazySegmentBytes$$anonfun$getSegments$1.apply(LazySegmentBytes.scala:99)
at geotrellis.raster.io.geotiff.LazySegmentBytes$$anonfun$getSegments$1.apply(LazySegmentBytes.scala:99)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1012)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2118)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2118)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2119)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at geotrellis.spark.TileLayerMetadata$.collectMetadataWithCRS(TileLayerMetadata.scala:147)
at geotrellis.spark.TileLayerMetadata$.fromRdd(TileLayerMetadata.scala:281)
at geotrellis.spark.package$withCollectMetadataMethods.collectMetadata(package.scala:212)
...
Update:
I extracted an example from my code and uploaded it to the repository at https://gitlab.com/hwuerz/geotrellis-spark-example. You can run the example locally using sbt run and selecting the class demo.HelloGeotrellis. This will create the tiles for a tiny input dataset example.tif according to our layout definition starting at zoom level 20 (using two cores per default, can be adjusted in the file HelloGeotrellis.scala ~ if level 20 somehow still works, it will most likely fail using higher values for bottomLayer).
To run the code on the Spark Cluster, I use the following command:
`sbt package && bash submit.sh --dataLocation /mnt/glusterfs/example.tif --bottomLayer 20 --topLayer 10 --cesiumTerrainDir /mnt/glusterfs/terrain/ --sparkMaster spark://192.168.0.8:7077`
Wheresubmit.sh basically runs spark-submit (see the file in the repo).
The example.tif is included in the repo within the directory DebugFiles. In my setup the file is distributed via glusterfs which is why the path points to this location. The cesiumTerrainDir is just an directory where we store our generated output.
We think the main problem might be that using the given api calls, geotrellis loads the complete structure of the layout into the memory, which is too big for higher zoom levels. Is there any way to avoid this?

Java Out of Memory exception in Ubuntu when using Flume/Hadoop

I'm getting an out of memory exception due to lack of Java heap space when I try and download tweets using Flume and pipe them into Hadoop.
I have set the heap space currently to 4GB in the mapred-site.xml of Hadoop, like so:
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx4096m</value>
</property>
I am hoping to download tweets continually for two days but can't get past 45 minutes without errors.
Since I do have the disk space to hold all of this, I am assuming the error is coming from Java having to handle so many things at once. Is there a way for me to slow down the speed at which these tweets are downloaded, or do something else to solve this problem?
Edit: flume.conf included
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <required>
TwitterAgent.sources.Twitter.consumerSecret = <required>
TwitterAgent.sources.Twitter.accessToken = <required>
TwitterAgent.sources.Twitter.accessTokenSecret = <required>
TwitterAgent.sources.Twitter.keywords = manchester united, man united, man utd, man u
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:50070/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
Edit 2
I've tried increasing the memory to 8GB which still doesn't help. I am assuming I am placing too many tweets in Hadoop at once and need to write them to disk and release the space again (or something to that effect). Is there a guide anywhere on how to do this?
Set JAVA_OPTS value at flume-env.sh and start flume agent.
It appears the problem had to do with the batch size and transactionCapacity. I changed them to the following:
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
This works without me even needing to change the JAVA_OPTS value.

Categories

Resources