Why small data size cause GC overhead limit exception?

Why small data size cause GC overhead limit exception? - java

I use spark doing some calculation.
Basically I did two thing:
New file will come into a folder periodically
I turn the new files into data frame then insert it into an previous data frame.
(You may ask why I read it in loop. I did it because of some reasons:
1. The files not comes at once. Actually it will come periodically. So I can not read them at once.
2. Although Stream can do this. I do not want to use stream. Because using Stream I need to set up a long window. It is not convienent to debug and test
)
The code is like below :
# Get the file list in the HDFS directory
client = InsecureClient('http://10.79.148.184:50070')
file_list = client.list('/test')
df_total = None
counter = 0
for file in file_list:
counter += 1
# turn each file (CSV format) into data frame
lines = sc.textFile("/test/%s" % file)
parts = lines.map(lambda l: l.split(","))
rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), protocol=p[7],bit=int(p[10])))
df = sqlContext.createDataFrame(rows)
# do some transform on the data frame
df_protocol = df.groupBy(['protocol']).agg(func.sum('bit').alias('bit'))
# add the current data frame to previous data frame set
if not df_total:
df_total = df_protocol
else:
df_total = df_total.unionAll(df_protocol)
# cache the df_total
df_total.cache()
if counter % 5 == 0:
df_total.rdd.checkpoint()
# get the df_total information
df_total.show()
I know that as time goes on, the df_total could be big. But actually, before that time come, the above code already raise exception.
When the loop is about 30 loops. The code throw GC overhead limit exceeded exception. The file is very small so even 300 loops the data size could only be about a few MB. I do not know why it throw GC error.
The exception detail is below ：
Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.toString(Integer.java:331)
at java.lang.Integer.toString(Integer.java:739)
at java.lang.String.valueOf(String.java:2854)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at org.apache.spark.storage.RDDBlockId.name(BlockId.scala:53)
at org.apache.spark.storage.BlockId.equals(BlockId.scala:46)
at java.util.HashMap.getEntry(HashMap.java:471)
at java.util.HashMap.get(HashMap.java:421)
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocations(BlockManagerMasterEndpoint.scala:371)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds$1.apply(BlockManagerMasterEndpoint.scala:376)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds$1.apply(BlockManagerMasterEndpoint.scala:376)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getLocationsMultipleBlockIds(BlockManagerMasterEndpoint.scala:376)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:72)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
16/04/20 09:52:00 ERROR TaskSchedulerImpl: Lost executor 0 on ES01: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/04/20 09:52:12 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=4721950849479578179, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=47]}} to ES01/10.79.148.184:53059; closing connection
io.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Java heap space
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:691)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:626)
at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:602)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:204)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:256)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:136)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:127)
at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:77)
at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
... 13 more

Related

Dataproc Hive Job - Tez Java heap OOM

I have a problem with my cluster.
the cluster have
2 worker primary
2 secondary worker
30 gb di ram
The cluster runs correctly and launches the job hives for at least about 10h.
After 10h I have an error of :Java heap space
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) ~[?:1.8.0_292]
at org.apache.hadoop.ipc.ResponseBuffer.toByteArray(ResponseBuffer.java:53) ~[hadoop-common-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1159) ~[hadoop-common-3.2.2.jar:?]
... 5 more
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
INFO : Completed executing command(queryId=hive_20210923102707_66b4cd11-7cfb-4910-87bc-7f062ce1b00e); Time taken: 75.101 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask (state=08S01,code=1)
i tried to set this cofiguration but it didn't help.
SET hive.execution.engine = tez;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET mapreduce.job.reduces=1;
SET hive.auto.convert.join=false;
set hive.stats.column.autogather=false;
set hive.optimize.sort.dynamic.partition=true;
is there any way to clean the java heap space or I have got some configuration wrong?
the problem is solved by restarting the cluster

It seems that the default Tez container and heap sizes set by Dataproc are too small for your job. You can update the following Hive properties to increase them:
hive.tez.container.size: The YARN container size in MB for Tez. If set to "-1" (default value), it picks the value of mapreduce.map.memory.mb. Consider increasing the value if the query / Tez app fails with something like "Container is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB physical memory used; 6.0 GB of 20 GB virtual memory used. Killing container.". Example: SET hive.tez.container.size=8192 in Hive, or --properties hive:hive.tez.container.size=8192 when creating the cluster.
hive.tez.java.opts: The JVM options for the Tez YARN application. If not set, it picks the value of mapreduce.map.java.opts. This value should be less or equal to the container size. Consider increasing the JVM heap size if the query / Tez app fails with an OOM exception. Example: SET hive.tez.java.opts=-Xmx8g or --properties hive:hive.tez.java.opts=-Xmx8g when creating the cluster.
You can check /etc/hadoop/conf/mapred-site.xml to get the value of mapreduce.map.java.opts, and /etc/hive/conf/hive-site.xml for the 2 Hive properties mentioned above.

How to determine maximum amount of data that can be handled by 1 run of MR2 job?

I am running a YARN job on CDH 5.3 cluster. I have default configurations.
No of nodes=3
yarn.nodemanager.resource.cpu-vcores=8
yarn.nodemanager.resource.memory-mb=10GB
mapreduce.[map/reduce].cpu.vcores=1
mapreduce.[map/reduce].memory.mb=1GB
mapreduce.[map | reduce].java.opts.max.heap=756MB
While doing a run on 4.5GB csv data spread over 11 files ,I get following error:
2015-10-12 05:21:04,507 FATAL [IPC Server handler 18 on 50388] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1444634391081_0005_r_000000_0 - exited : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#9
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
Then I tuned mapreduce.reduce.memory.mb=1GB to mapreduce.reduce.memory.mb=3GB and job runned fine.
So how to decide on how much data maximum can be handled by 1 reducer assuming that all the input to mapper have to be processed by 1 reducer only?

Generally there is no limitation on the data that can be processed by a single reducer. The memory allocation can slow down the process but must not restrict or fail to process the data. I believe after allocating minimum memory to reducer the data processing should not be an issue. Can u pls share some code snippet to check for any memory leak issues.
We used to process 6+Gb of file in a single reducer withou any issues. I believe you might be having memory leak issues.

Storagelevel in spark RDD MEMORY_AND_DISK_2() throw exception

Can anyone explain how storage level of rdd works.
I got heap memory error when I use persist method with storage level(StorageLevel.MEMORY_AND_DISK_2())
However my code works fine when I use cache method.
As per spark doc documentation cache Persist RDD with the default storage level (MEMORY_ONLY).
My code where I get heap error
JavaRDD<String> rawData = sparkContext
.textFile(inputFile.getAbsolutePath())
.setName("Input File").persist(SparkToolConstant.rdd_stroage_level);
// cache()
String[] headers = new String[0];
String headerStr = null;
if (headerPresent) {
headerStr = rawData.first();
headers = headerStr.split(delim);
List<String> headersList = new ArrayList<String>();
headersList.add(headerStr);
JavaRDD<String> headerRDD = sparkContext
.parallelize(headersList);
JavaRDD<String> filteredRDD = rawData.subtract(headerRDD)
.setName("Raw data without header").persist(StorageLevel.MEMORY_AND_DISK_2());;
rawData = filteredRDD;
}
Stack trace
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 10, localhost): java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1176)
at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1185)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:846)
at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:668)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:176)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:79)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
Spark version : 1.3.0

Seeing this go unanswered for so long, I post this for general info and for those like me whose searches lead here.
This type of question is hard to answer without more specifics about your application. In general, it does seem upside down that you'd get a memory error when serializing to disk. I suggest you try with Kryo serialization and if you have a lot of extra memory somewhere use Alluxio (the software formerly known as Tachyon :) for "disk serialization," this will speed things up.
More from Spark docs on Tuning Data Storage, Serialized RDD Storage and (maybe helpful) GC Tuning:
When your objects are still too large to efficiently store despite
this tuning, a much simpler way to reduce memory usage is to store
them in serialized form, using the serialized StorageLevels in the
RDD persistence API, such as MEMORY_ONLY_SER. Spark will then
store each RDD partition as one large byte array. The only downside of
storing data in serialized form is slower access times, due to having
to deserialize each object on the fly. We highly recommend using Kryo
if you want to cache data in serialized form, as it leads to much
smaller sizes than Java serialization (and certainly than raw Java
objects).

How to resolve mongodb client Timeout waiting for a pooled item after 120000 MILLISECONDS exceptino error

I have a Java class (not Java Spring or server) which
1) inserts documents to one table,
2) reads documents from other table,
3) insert documents to another table and
4) delete documents from another table.
All above 4 operations happens with 3 tables.
I get the following error.
Exception in thread "pool-1-thread-240" com.mongodb.MongoTimeoutException: Timeout waiting for a pooled item after 120000 MILLISECONDS
at com.mongodb.ConcurrentPool.get(ConcurrentPool.java:113)
at com.mongodb.PooledConnectionProvider.get(PooledConnectionProvider.java:75)
at com.mongodb.DefaultServer.getConnection(DefaultServer.java:73)
at com.mongodb.BaseCluster$WrappedServer.getConnection(BaseCluster.java:221)
at com.mongodb.DBTCPConnector$MyPort.getConnection(DBTCPConnector.java:508)
at com.mongodb.DBTCPConnector$MyPort.get(DBTCPConnector.java:456)
at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:414)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:176)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:159)
at com.mongodb.DBCollection.insert(DBCollection.java:93)
at com.mongodb.DBCollection.insert(DBCollection.java:78)
at com.mongodb.DBCollection.insert(DBCollection.java:120)
at MyProgram$MyClass.run(MyProgram.java:149)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
a) How can I fix it?
I am using mongod 2.6.3 in Mac OS System.
b) Should I increase the mongodb pool in my client side.
c) If yes, how should I do it?
d) What is the maximum number to which I can set it?
I get this problem for the line in my java code where I do insert operation.

The default pool connection number is 100 and the maximum pool size is 20 for a single MongoDB instance.
For resolve your problem take this article in consideration

Best approach to handle large volume of data returned from DB

One of the application I work on generates reports using large volume of data returned from DB. The application receives the request as a AJAX async request, executes a query based on the input parameters, stores the data in JSON format, generates an excel with the same contents and then returns the JSON result back to the browser to display the records.
At times due to high volume of data the server crashes due to high memory usage and at times the server threads are hung. The exception captured in the logs is mentioned below :
[3/20/14 21:55:07:051 IST] 0000001a ThreadMonitor W WSVR0605W: Thread "WebContainer : 6" (00000038) has been active for 761754 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at com.ibm.db2.jcc.a.ab.b(ab.java:168)
at com.ibm.db2.jcc.a.ab.c(ab.java:222)
at com.ibm.db2.jcc.a.ab.c(ab.java:337)
at com.ibm.db2.jcc.a.ab.v(ab.java:1447)
at com.ibm.db2.jcc.a.db.a(db.java:42)
at com.ibm.db2.jcc.a.r.a(r.java:30)
at com.ibm.db2.jcc.a.sb.g(sb.java:152)
at com.ibm.db2.jcc.b.zc.n(zc.java:1186)
at com.ibm.db2.jcc.b.ad.db(ad.java:1761)
at com.ibm.db2.jcc.b.ad.d(ad.java:2203)
at com.ibm.db2.jcc.b.ad.U(ad.java:489)
at com.ibm.db2.jcc.b.ad.executeQuery(ad.java:472)
at com.ibm.ws.rsadapter.jdbc.WSJdbcPreparedStatement.pmiExecuteQuery(WSJdbcPreparedStatement.java:1102)
at com.ibm.ws.rsadapter.jdbc.WSJdbcPreparedStatement.executeQuery(WSJdbcPreparedStatement.java:723)
at com.abc(ABCDao.java:1250)
at com.abc(ABCFacade.java:159)
at com.abcservlet.doPost(ABCServlet.java:56)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:738)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:831)
Environment details are as follows :
1. Application Server - WebSphere Application Server 7.0.0.29
2. DB Server - DB2
3. JVM Memory - 512MB - 2048 MB
Wanted to know what should be the best approach to handle these kind of scenarios. We use poi to generate the reports and we using "org.apache.poi.xssf.streaming.SXSSFWorkbook" to keep less data on the server. But if the data exceeds 200000 records the server thread gets hung.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why small data size cause GC overhead limit exception? - java

Related

Dataproc Hive Job - Tez Java heap OOM

How to determine maximum amount of data that can be handled by 1 run of MR2 job?

Storagelevel in spark RDD MEMORY_AND_DISK_2() throw exception

How to resolve mongodb client Timeout waiting for a pooled item after 120000 MILLISECONDS exceptino error

Best approach to handle large volume of data returned from DB

Categories

Resources