Spark Closure cleaning and serialization OOM

Spark Closure cleaning and serialization OOM - java

I have been stuck on this problem for the last few days:
I am attempting to run random forest from MLLIB, it gets through most of it, but breaks when doing a mapPartition operation. The following stack trace is shown:
: An error occurred while calling o94.trainRandomForestModel.
: java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
at org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
It seems to me that it's trying to serialize the mapPartitions closure, but runs out of space doing so. However I don't understand how it could run out of space when I gave the driver ~190GB for a file that's 45MB.
I have a cluster setup on AWS such that my master is a r3.8xlarge along with two r3.4xlarge workers. I have the following configurations:
spark version: 1.5.0
-----------------------------------
spark.executor.memory 32000m
spark.driver.memory 230000m
spark.driver.cores 10
spark.executor.cores 5
spark.executor.instances 17
spark.driver.maxResultSize 0
spark.storage.safetyFraction 1
spark.storage.memoryFraction 0.9
spark.storage.shuffleFraction 0.05
spark.default.parallelism 128
The master machine has approximately 240 GB of ram and each worker has about 120GB of ram.
I load in a relatively tiny RDD of MLLIB LabeledPoint objects, with each holding sparse vectors inside. This RDD has a total size of roughly 45MB. My sparse vector has a total length of ~15 million while only about 3000 or so are non-zeros.

Related

Dataproc Hive Job - Tez Java heap OOM

I have a problem with my cluster.
the cluster have
2 worker primary
2 secondary worker
30 gb di ram
The cluster runs correctly and launches the job hives for at least about 10h.
After 10h I have an error of :Java heap space
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_292]
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) ~[?:1.8.0_292]
at org.apache.hadoop.ipc.ResponseBuffer.toByteArray(ResponseBuffer.java:53) ~[hadoop-common-3.2.2.jar:?]
at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1159) ~[hadoop-common-3.2.2.jar:?]
... 5 more
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask
INFO : Completed executing command(queryId=hive_20210923102707_66b4cd11-7cfb-4910-87bc-7f062ce1b00e); Time taken: 75.101 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask (state=08S01,code=1)
i tried to set this cofiguration but it didn't help.
SET hive.execution.engine = tez;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET mapreduce.job.reduces=1;
SET hive.auto.convert.join=false;
set hive.stats.column.autogather=false;
set hive.optimize.sort.dynamic.partition=true;
is there any way to clean the java heap space or I have got some configuration wrong?
the problem is solved by restarting the cluster

It seems that the default Tez container and heap sizes set by Dataproc are too small for your job. You can update the following Hive properties to increase them:
hive.tez.container.size: The YARN container size in MB for Tez. If set to "-1" (default value), it picks the value of mapreduce.map.memory.mb. Consider increasing the value if the query / Tez app fails with something like "Container is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB physical memory used; 6.0 GB of 20 GB virtual memory used. Killing container.". Example: SET hive.tez.container.size=8192 in Hive, or --properties hive:hive.tez.container.size=8192 when creating the cluster.
hive.tez.java.opts: The JVM options for the Tez YARN application. If not set, it picks the value of mapreduce.map.java.opts. This value should be less or equal to the container size. Consider increasing the JVM heap size if the query / Tez app fails with an OOM exception. Example: SET hive.tez.java.opts=-Xmx8g or --properties hive:hive.tez.java.opts=-Xmx8g when creating the cluster.
You can check /etc/hadoop/conf/mapred-site.xml to get the value of mapreduce.map.java.opts, and /etc/hive/conf/hive-site.xml for the 2 Hive properties mentioned above.

How to determine maximum amount of data that can be handled by 1 run of MR2 job?

I am running a YARN job on CDH 5.3 cluster. I have default configurations.
No of nodes=3
yarn.nodemanager.resource.cpu-vcores=8
yarn.nodemanager.resource.memory-mb=10GB
mapreduce.[map/reduce].cpu.vcores=1
mapreduce.[map/reduce].memory.mb=1GB
mapreduce.[map | reduce].java.opts.max.heap=756MB
While doing a run on 4.5GB csv data spread over 11 files ,I get following error:
2015-10-12 05:21:04,507 FATAL [IPC Server handler 18 on 50388] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1444634391081_0005_r_000000_0 - exited : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#9
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:303)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:293)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:511)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:329)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
Then I tuned mapreduce.reduce.memory.mb=1GB to mapreduce.reduce.memory.mb=3GB and job runned fine.
So how to decide on how much data maximum can be handled by 1 reducer assuming that all the input to mapper have to be processed by 1 reducer only?

Generally there is no limitation on the data that can be processed by a single reducer. The memory allocation can slow down the process but must not restrict or fail to process the data. I believe after allocating minimum memory to reducer the data processing should not be an issue. Can u pls share some code snippet to check for any memory leak issues.
We used to process 6+Gb of file in a single reducer withou any issues. I believe you might be having memory leak issues.

warn message using hazelcast 3.3 during mapstore loadAll

I am using Hazelcast v3.3. The Hazelcast server runs a few map store implementations. I have sporadically seen the following warning in the logs and need to understand what might be causing this and whether it could lead to data loss in hazelcast. I have a pretty small cluster at the moment for testing (4VMs running Ubuntu13 - 2GB RAM each).
2014-09-14 18:55:21,886 WARN c.h.s.i.BasicInvocation [main] [xxx.xxx.xxx.xxx]:5701 [testApp] [3.3] While asking 'is-executing': BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:
impl:mapService', op=com.hazelcast.spi.impl.PartitionIteratingOperation#285583d4, partitionId=-1, replicaIndex=0, tryCount=10, tryPauseMillis=300, invokeCount=1, callTimeout=60000, target=Address[xxx.xxx.xxx.xxx]:5701}, done=false} java.util.concurrent.TimeoutException: Call BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.IsStillExecutingOperation#5b202ff, partit
ionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[xxx.xxx.xxx.xxx]:5701} encountered a timeout
at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponse(BasicInvocationFuture.java:321)
at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponseOrThrowException(BasicInvocationFuture.java:289)
at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:181)
at com.hazelcast.spi.impl.BasicInvocationFuture.isOperationExecuting(BasicInvocationFuture.java:390)
at com.hazelcast.spi.impl.BasicInvocationFuture.waitForResponse(BasicInvocationFuture.java:228)
at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:180)
at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:160)
at com.hazelcast.spi.impl.BasicOperationService$InvokeOnPartitions.awaitCompletion(BasicOperationService.java:489)
at com.hazelcast.spi.impl.BasicOperationService$InvokeOnPartitions.invoke(BasicOperationService.java:458)
at com.hazelcast.spi.impl.BasicOperationService$InvokeOnPartitions.access$600(BasicOperationService.java:430)
at com.hazelcast.spi.impl.BasicOperationService.invokeOnAllPartitions(BasicOperationService.java:293)
at com.hazelcast.map.proxy.MapProxySupport.size(MapProxySupport.java:616)
at com.hazelcast.map.proxy.MapProxyImpl.size(MapProxyImpl.java:72)
at com.akkadian.wildmetrix.StartHcastServer.main(StartHcastServer.java:105)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)

Mapreduce job failed with IO Exception

I am running a single node hadoop environment. I have a mapreduce job to calculate average of some monitored information for some specific time periods, say hourly average.This job write output to a path within hdfs.It is get cleaned up ech time before running the job. It was working fine for a month. Yesterday , while running job, I got an exception from jobclient, says:
File /user/root/out1/_temporary/_attempt_201401141113_0007_r_000000_0/hi/130-r-00000 could only be replicated to 0 nodes, instead of 1
Full stacktrace is as follows:
..........
14/01/17 12:00:09 INFO mapred.JobClient: map 100% reduce 32%
14/01/17 12:00:12 INFO mapred.JobClient: map 100% reduce 74%
14/01/17 12:00:17 INFO mapred.JobClient: Task Id : attempt_201401141113_0007_r_000000_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/root/out1/_temporary/_attempt_201401141113_0007_r_000000_0/hi/130-r-00000 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy2.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy2.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3510)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3373)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2589)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2829)
From the initial search on google, says about storage space issue.But I don't think so, because my whole input data should be less than 600MB and there is around 1.5GB free space available on node. I ran hadoop dfsadmin -report command and it return as follows:
$hadoop dfsadmin -report
Configured Capacity: 11353194496 (10.57 GB)
Present Capacity: 2354425856 (2.19 GB)
DFS Remaining: 1633726464 (1.52 GB)
DFS Used: 720699392 (687.31 MB)
DFS Used%: 30.61%
Under replicated blocks: 49
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Name: 192.168.1.149:50010
Decommission Status : Normal
Configured Capacity: 11353194496 (10.57 GB)
DFS Used: 720699392 (687.31 MB)
Non DFS Used: 8998768640 (8.38 GB)
DFS Remaining: 1633726464(1.52 GB)
DFS Used%: 6.35%
DFS Remaining%: 14.39%
Last contact: Fri Jan 17 04:36:55 GMT+05:30 2014
Please give me a solution.Is this can be a configuration issue. I dont know much about the hadoop configuration.Please help..

I think that, maybe, your problem is actually a space problem. You have one replica set, so if your input is 600 mb it will take 1.2 gb on your cluster.
You still have 300mb free that is probably not enough to send the data between nodes.
My advice is to use a smaller dataset to check if this is the problem, of around 300mb or less. Then, if you don't solve it this way, try to set the replicas to 0 on the conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>0</value>
</property>

Runtime partition failed for this job in Hama BSP

I encountered the following problem when start running a hama BSP job. This exception occurs when hama tries to load and partition the input data before it actually runs my own code. This is a known problem discussed in some websites but unfortunate without a known cause (eg. see here).
My BSP job works perfectly ok when I only runs part of the data set. However, when I run the full data set, the problem occurs :(
Can I know how to resolve or avoid this problem?
13/11/18 01:19:30 INFO bsp.FileInputFormat: Total input paths to process : 32
13/11/18 01:19:30 INFO bsp.FileInputFormat: Total input paths to process : 32
13/11/18 01:19:30 INFO bsp.BSPJobClient: Running job: job_201311180115_0002
13/11/18 01:19:33 INFO bsp.BSPJobClient: Current supersteps number: 0
13/11/18 01:19:33 INFO bsp.BSPJobClient: Job failed.
13/11/18 01:19:33 ERROR bsp.BSPJobClient: Error partitioning the input path.
java.io.IOException: Runtime partition failed for the job.
at org.apache.hama.bsp.BSPJobClient.partition(BSPJobClient.java:465)
at org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:333)
at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:293)
at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:228)
at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:235)
at edu.wisc.cs.db.opener.hama.ConnectedEntityBspDriver.main(ConnectedEntityBspDriver.java:183)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hama.util.RunJar.main(RunJar.java:146)

After stuck at this problem for several hours, I found that once the number of input files is greater than the number of allowed bsp tasks, then this error will occur. I think it is probably a bug that Hama should fix in the future.
A quick fix to this problem is to increase the number of maximum bsp tasks, specified by the variable bsp.tasks.maximum in the hama-site.xml file. For example, the following uses 10 instead of the default setting 3:
<property>
<name>bsp.tasks.maximum</name>
<value>10</value>
<description>The maximum number of BSP tasks that will be run simultaneously
by a groom server.</description>
</property>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Closure cleaning and serialization OOM - java

Related

Dataproc Hive Job - Tez Java heap OOM

How to determine maximum amount of data that can be handled by 1 run of MR2 job?

warn message using hazelcast 3.3 during mapstore loadAll

Mapreduce job failed with IO Exception

Runtime partition failed for this job in Hama BSP

Categories

Resources