warn message using hazelcast 3.3 during mapstore loadAll - java

I am using Hazelcast v3.3. The Hazelcast server runs a few map store implementations. I have sporadically seen the following warning in the logs and need to understand what might be causing this and whether it could lead to data loss in hazelcast. I have a pretty small cluster at the moment for testing (4VMs running Ubuntu13 - 2GB RAM each).
2014-09-14 18:55:21,886 WARN c.h.s.i.BasicInvocation [main] [xxx.xxx.xxx.xxx]:5701 [testApp] [3.3] While asking 'is-executing': BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:
impl:mapService', op=com.hazelcast.spi.impl.PartitionIteratingOperation#285583d4, partitionId=-1, replicaIndex=0, tryCount=10, tryPauseMillis=300, invokeCount=1, callTimeout=60000, target=Address[xxx.xxx.xxx.xxx]:5701}, done=false} java.util.concurrent.TimeoutException: Call BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.IsStillExecutingOperation#5b202ff, partit
ionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[xxx.xxx.xxx.xxx]:5701} encountered a timeout
at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponse(BasicInvocationFuture.java:321)
at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponseOrThrowException(BasicInvocationFuture.java:289)
at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:181)
at com.hazelcast.spi.impl.BasicInvocationFuture.isOperationExecuting(BasicInvocationFuture.java:390)
at com.hazelcast.spi.impl.BasicInvocationFuture.waitForResponse(BasicInvocationFuture.java:228)
at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:180)
at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:160)
at com.hazelcast.spi.impl.BasicOperationService$InvokeOnPartitions.awaitCompletion(BasicOperationService.java:489)
at com.hazelcast.spi.impl.BasicOperationService$InvokeOnPartitions.invoke(BasicOperationService.java:458)
at com.hazelcast.spi.impl.BasicOperationService$InvokeOnPartitions.access$600(BasicOperationService.java:430)
at com.hazelcast.spi.impl.BasicOperationService.invokeOnAllPartitions(BasicOperationService.java:293)
at com.hazelcast.map.proxy.MapProxySupport.size(MapProxySupport.java:616)
at com.hazelcast.map.proxy.MapProxyImpl.size(MapProxyImpl.java:72)
at com.akkadian.wildmetrix.StartHcastServer.main(StartHcastServer.java:105)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)

Related

Pentaho Data Integration - General Error in Dialog

When I try to edit my database connections, I get the "General Error in Dialog" message. Upon clicking details, I get the following:
java.lang.NullPointerException at
org.pentaho.di.ui.core.database.dialog.XulDatabaseDialog.open(XulDatabaseDialog.java:113)
at
org.pentaho.di.ui.core.database.dialog.DatabaseDialog.open(DatabaseDialog.java:61)
at
org.pentaho.di.ui.spoon.delegates.SpoonDBDelegate.editConnection(SpoonDBDelegate.java:96)
at org.pentaho.di.ui.spoon.Spoon.editConnection(Spoon.java:2725) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.pentaho.ui.xul.impl.AbstractXulDomContainer.invoke(AbstractXulDomContainer.java:313)
at
org.pentaho.ui.xul.impl.AbstractXulComponent.invoke(AbstractXulComponent.java:157)
at
org.pentaho.ui.xul.impl.AbstractXulComponent.invoke(AbstractXulComponent.java:141)
at
org.pentaho.ui.xul.jface.tags.JfaceMenuitem.access$100(JfaceMenuitem.java:43)
at
org.pentaho.ui.xul.jface.tags.JfaceMenuitem$1.run(JfaceMenuitem.java:106)
at org.eclipse.jface.action.Action.runWithEvent(Action.java:498) at
org.eclipse.jface.action.ActionContributionItem.handleWidgetSelection(ActionContributionItem.java:545)
at
org.eclipse.jface.action.ActionContributionItem.access$2(ActionContributionItem.java:490)
at
org.eclipse.jface.action.ActionContributionItem$5.handleEvent(ActionContributionItem.java:402)
at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:89)
at org.eclipse.swt.widgets.Display.sendEvent(Display.java:4385) at
org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1512) at
org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1535) at
org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1520) at
org.eclipse.swt.widgets.Widget.notifyListeners(Widget.java:1324) at
org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:4172)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3789)
at org.pentaho.di.ui.spoon.Spoon.readAndDispatch(Spoon.java:1385) at
org.pentaho.di.ui.spoon.Spoon.waitForDispose(Spoon.java:7968) at
org.pentaho.di.ui.spoon.Spoon.start(Spoon.java:9350) at
org.pentaho.di.ui.spoon.Spoon.main(Spoon.java:711) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.pentaho.commons.launcher.Launcher.main(Launcher.java:92)
I'm using Java version 8 update 261. I know that PDI requires Java 8 so I don't understand why I'm having this issue. Currently, I have to delete and then create a new database connection for each job, which is really tedious.
How do I resolve this issue? Or is there another way to set my database connections so that I don't have to create a new one for each job?
I don't know about your error message, but to share connections IN YOUR PC you have the option in the View tab:
I have an empty transformation, so all the DB connections shown are the shared ones, if I had a new connection in the transformation it would be also shown but the connection name wouldn't be in bold letters.
Beware, the shared connections information are only available in your PC, and the connection information is copied to the transformation or job, so if you change the information in your shared connection, you need to reopen all the transformations and jobs that use it and save them with the new information.
In the .kettle folder in your user folder there's a shared.xml file with the shared connections information.

Spark possible race condition in driver

I have a Spark job that processes several folders on S3 per run and stores its state on DynamoDB. In other words, we're running the job once per day, it looks for new folders added by another job, transforms them one-by-one and writes state to DynamoDB. Here's rough pseudocode:
object App {
val allFolders = S3Folders.list()
val foldersToProcess = DynamoDBState.getFoldersToProcess(allFolders)
Transformer.run(foldersToProcess)
}
object Transformer {
def run(folders: List[String]): Unit = {
val sc = new SparkContext()
folders.foreach(process(sc, _))
}
def process(sc: SparkContext, folder: String): Unit = ??? // transform and write to S3
}
This approach works well if S3Folders.list() returns relatively small amount of folders (up to few thousands), if it returns more (4-8K) very often we see following error (that in first glance has nothing to do with Spark):
17/10/31 08:38:20 ERROR ApplicationMaster: User class threw exception: shadeaws.SdkClientException: Failed to sanitize XML document destined for handler class shadeaws.services.s3.model.transform.XmlResponses
SaxParser$ListObjectsV2Handler
shadeaws.SdkClientException: Failed to sanitize XML document destined for handler class shadeaws.services.s3.model.transform.XmlResponsesSaxParser$ListObjectsV2Handler
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:214)
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.parseListObjectsV2Response(XmlResponsesSaxParser.java:315)
at shadeaws.services.s3.model.transform.Unmarshallers$ListObjectsV2Unmarshaller.unmarshall(Unmarshallers.java:88)
at shadeaws.services.s3.model.transform.Unmarshallers$ListObjectsV2Unmarshaller.unmarshall(Unmarshallers.java:77)
at shadeaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at shadeaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at shadeaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at shadeaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1553)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1271)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1055)
at shadeaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at shadeaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at shadeaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at shadeaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at shadeaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4247)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4194)
at shadeaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4188)
at shadeaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:865)
at me.chuwy.transform.S3Folders$.com$chuwy$transform$S3Folders$$isGlacierified(S3Folders.scala:136)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:267)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:104)
at me.chuwy.transform.S3Folders$.list(S3Folders.scala:112)
at me.chuwy.transform.Main$.main(Main.scala:22)
at me.chuwy.transform.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: shadeaws.AbortedException:
at shadeaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at shadeaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at shadeaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:210)
at java.io.BufferedReader.read(BufferedReader.java:286)
at java.io.Reader.read(Reader.java:140)
at shadeaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:186)
... 36 more
For big amount of folders (~20K) this happens all the time and job cannot start.
Previously we had very similar, but much more frequent error when getFoldersToProcess did GetItem for every folder from allFolders and therefore took much longer:
17/09/30 14:46:07 ERROR ApplicationMaster: User class threw exception: shadeaws.AbortedException:
shadeaws.AbortedException:
at shadeaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:51)
at shadeaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at shadeaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:489)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:126)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1240)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:802)
at shadeaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:109)
at shadeaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
at shadeaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at shadeaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1503)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1226)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
at shadeaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
at shadeaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
at shadeaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at shadeaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at shadeaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at shadeaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:2089)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:2065)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.executeGetItem(AmazonDynamoDBClient.java:1173)
at shadeaws.services.dynamodbv2.AmazonDynamoDBClient.getItem(AmazonDynamoDBClient.java:1149)
at me.chuwy.tranform.sdk.Manifest$.contains(Manifest.scala:179)
at me.chuwy.tranform.DynamoDBState$$anonfun$getUnprocessed$1.apply(ProcessManifest.scala:44)
at scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filterNot(TraversableLike.scala:267)
at scala.collection.AbstractTraversable.filterNot(Traversable.scala:104)
at me.chuwy.transform.DynamoDBState$.getFoldersToProcess(DynamoDBState.scala:44)
at me.chuwy.transform.Main$.main(Main.scala:19)
at me.chuwy.transform.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
I believe that current error has nothing to do with XML parsing or invalid response, but originate from some race condition inside Spark, because:
There's clear connection between amount of time "state-fetching" takes and chance of failure
Tracebacks have underlying AbortedException, which AFAIK caused by swallowed InterruptedException, which can mean something inside JVM (spark-submit or even YARN) calls Thread.sleep for main thread.
Right now I'm using EMR AMI 5.5.0, Spark 2.1.0 and shaded AWS SDK 1.11.208, but had similar error with AWS SDK 1.10.75.
I'm deploying this job on EMR via command-runner.jar spark-submit --deploy-mode cluster --class ....
Does anyone have any idea where does this exception originate from and how to fix it?
foreach does not guarantee orderly computations and it applies the operation(s) to each element of an RDD, meaning that it will instantiate for every element which, in turn, may overwhelm the executor.
The problem was that getFoldersToProcess is a blocking (and very long) operation, which prevents SparkContext from being instantiated. SpackContext itself should signal about own instantiation to YARN and if it doesn't help in a certain amount of time - YARN assumes that driver node has fallen off and kills the whole cluster.

BeanManager method getReference() is not available during application initialization

I am trying to deploy a war file into Wildfly 10.0.0.Final, the deployment fails with the following error :
"{\"WFLYCTL0080: Failed services\" => {\"jboss.deployment.unit.\\\"management-app.war\\\".WeldStartService\" => \"org.jboss.msc.service.StartException in service jboss.deployment.unit.\\\"management-app.war\\\".WeldStartService: Failed to start service
Caused by: org.jboss.weld.exceptions.DefinitionException: Exception List with 1 exceptions:
Exception 0 :
org.jboss.weld.exceptions.IllegalStateException: WELD-001332: BeanManager method getReference() is not available during application initialization
at org.jboss.weld.bean.builtin.BeanManagerProxy.checkContainerState(BeanManagerProxy.java:242)
at org.jboss.weld.bean.builtin.BeanManagerProxy.getReference(BeanManagerProxy.java:84)
at morpho.mcp.adapter.cdi.AdapterExtension.getAdapterDefinition(AdapterExtension.java:159)
at morpho.mcp.adapter.cdi.AdapterExtension.afterBeanDiscovery(AdapterExtension.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:88)
at org.jboss.weld.injection.MethodInvocationStrategy$SpecialParamPlusBeanManagerStrategy.invoke(MethodInvocationStrategy.java:144)
at org.jboss.weld.event.ObserverMethodImpl.sendEvent(ObserverMethodImpl.java:309)
at org.jboss.weld.event.ExtensionObserverMethodImpl.sendEvent(ExtensionObserverMethodImpl.java:124)
at org.jboss.weld.event.ObserverMethodImpl.sendEvent(ObserverMethodImpl.java:287)
at org.jboss.weld.event.ObserverMethodImpl.notify(ObserverMethodImpl.java:265)
at org.jboss.weld.event.ObserverNotifier.notifySyncObservers(ObserverNotifier.java:271)
at org.jboss.weld.event.ObserverNotifier.notify(ObserverNotifier.java:260)
at org.jboss.weld.event.ObserverNotifier.fireEvent(ObserverNotifier.java:154)
at org.jboss.weld.event.ObserverNotifier.fireEvent(ObserverNotifier.java:148)
at org.jboss.weld.bootstrap.events.AbstractContainerEvent.fire(AbstractContainerEvent.java:53)
at org.jboss.weld.bootstrap.events.AbstractDefinitionContainerEvent.fire(AbstractDefinitionContainerEvent.java:42)
at org.jboss.weld.bootstrap.events.AfterBeanDiscoveryImpl.fire(AfterBeanDiscoveryImpl.java:61)
at org.jboss.weld.bootstrap.WeldStartup.deployBeans(WeldStartup.java:423)
at org.jboss.weld.bootstrap.WeldBootstrap.deployBeans(WeldBootstrap.java:83)
at org.jboss.as.weld.WeldStartService.start(WeldStartService.java:95)
at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1948)
at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1881)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
\"}}"
Assuming that the war was working fine for JBoss AS7. What could be the possible root cause for such behavior ?
It seems that a WAR which is not a bean archive is missing the org.jboss.as.weld.WeldStartService dependency. But how can this issue be resolved ?
But how can this issue be resolved ?
Do not use BeanManager.getReference before AfterDeploymentValidation. Refactor your code accordingly. The reason is - getReference might give you incomplete results. At that point, the bootstrap is not yet done and getReference method might fail to retrieve a reference as there might not be one at a given time (although it would be there later on).
As for how to workaround this:
Use Weld non-portable mode although that might have more undesirable effects. You should firstly try to refactor the code before falling back to this configuration.

Hadoop Container Cleanup Timeout on lost node

I am Working on Multinode Cluster with four slaves node named as slave01,slave02,slave03 and slave04 and one master node as master
when i remove out the network cable during map task hadoop wait for status update for 100 seconds (due to property of whose value is 100000)
after that i can see that maptask get failed and hadoop start container cleanup which takes more than 10 minutes and it also doesn't schedule failed task anywhere.the i get error of no Route to host exception from application master to lost node.After which task get schedule on another node.
i want to reduce the time for trying container cleanup so that task can be schedule just after timeout of maptask on any node.
please help me that how can i do that by setting configuration.
I am attaching application master log in which i have remove slave01 during map task,in this case no of reduce task running is 1.
AttemptID:attempt_1463201584280_0004_m_000002_0 Timed out after 100 secs Container released on a lost node cleanup failed for container container_1463201584280_0004_01_000004 : java.net.NoRouteToHostException: No Route to Host from slave02/172.31.132.107 to slave01:58838 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost at sun.reflect.GeneratedConstructorAccessor51.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy37.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy38.stopContainers(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) ... 15 more
This happened because of an bug in hadoop 2.6.3 in which connection retry in done at two levels ipc and yarn try to use 2.6.4 or download patch it will get resolved.

Spark Closure cleaning and serialization OOM

I have been stuck on this problem for the last few days:
I am attempting to run random forest from MLLIB, it gets through most of it, but breaks when doing a mapPartition operation. The following stack trace is shown:
: An error occurred while calling o94.trainRandomForestModel.
: java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
at org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
It seems to me that it's trying to serialize the mapPartitions closure, but runs out of space doing so. However I don't understand how it could run out of space when I gave the driver ~190GB for a file that's 45MB.
I have a cluster setup on AWS such that my master is a r3.8xlarge along with two r3.4xlarge workers. I have the following configurations:
spark version: 1.5.0
-----------------------------------
spark.executor.memory 32000m
spark.driver.memory 230000m
spark.driver.cores 10
spark.executor.cores 5
spark.executor.instances 17
spark.driver.maxResultSize 0
spark.storage.safetyFraction 1
spark.storage.memoryFraction 0.9
spark.storage.shuffleFraction 0.05
spark.default.parallelism 128
The master machine has approximately 240 GB of ram and each worker has about 120GB of ram.
I load in a relatively tiny RDD of MLLIB LabeledPoint objects, with each holding sparse vectors inside. This RDD has a total size of roughly 45MB. My sparse vector has a total length of ~15 million while only about 3000 or so are non-zeros.

Categories

Resources