Unable to kill YARN job - java

I have a simple Samza job, which I submit to our YARN cluster. The job allocates one single container and runs without any issues.
When trying to kill the job, however, both the AM and job containers are left running on the NM, even though the RM claims that the job was killed successfully:
$ yarn application -kill application_1461969364354_5761
Killing application application_1461969364354_5761
16/05/19 06:48:49 INFO impl.YarnClientImpl: Killed application application_1461969364354_5761
From the NM log, I'm able to see:
2016-05-19 06:48:50,051 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e34_1461969364354_5761_01_000002 transitioned from RUNNING to KILLING
2016-05-19 06:48:50,051 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_e34_1461969364354_5761_01_000002
2016-05-19 06:48:50,060 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1461969364354_5761 transitioned from RUNNING to FINISHING_CONTAI
NERS_WAIT
The status never transitions out of FINISHING_CONTAINERS_WAIT and I had to kill -9 the container process.
I'm using Samza version 0.10.0 and YARN version Hadoop 2.6.0-cdh5.4.9.
Any idea?
UPDATE:
After some digging, I'm able to see this:
2016-05-20 03:14:59,497 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking finishApplicationMaster of class ApplicationMasterProtocolPBClientImpl over rm157 after 2326 fail over attempts. Trying to fail over immediately.
org.apache.hadoop.security.token.SecretManager$InvalidToken: appattempt_1463512986427_0017_000001 not found in AMRMTokenSecretManager.
at sun.reflect.GeneratedConstructorAccessor13.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:94)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy16.finishApplicationMaster(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:378)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.unregisterApplicationMaster(AMRMClientAsyncImpl.java:157)
at org.apache.samza.job.yarn.SamzaAppMasterLifecycle.onShutdown(SamzaAppMasterLifecycle.scala:63)
at org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$run$5.apply(SamzaAppMaster.scala:133)
at org.apache.samza.job.yarn.SamzaAppMaster$$anonfun$run$5.apply(SamzaAppMaster.scala:132)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.samza.job.yarn.SamzaAppMaster$.run(SamzaAppMaster.scala:132)
at org.apache.samza.job.yarn.SamzaAppMaster$.main(SamzaAppMaster.scala:104)
at org.apache.samza.job.yarn.SamzaAppMaster.main(SamzaAppMaster.scala)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): appattempt_1463512986427_0017_000001 not found in AMRMTokenSecretManager.
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy15.finishApplicationMaster(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
... 15 more

Related

Cannot use storm command line interface to kill topology

I have a storm cluster, my storm version is 0.10.1, and it run well. I can kill topology using ui in browser which is good. But sometime I want to just kill topology by shell. When I run this: bin/storm kill topology_name, I get errors.
Exception in thread "main" NotAliveException(msg:com_adsame_mydmp is not alive)
at backtype.storm.generated.Nimbus$killTopologyWithOpts_result$killTopologyWithOpts_resultStandardScheme.read(Nimbus.java:7482)
at backtype.storm.generated.Nimbus$killTopologyWithOpts_result$killTopologyWithOpts_resultStandardScheme.read(Nimbus.java:7468)
at backtype.storm.generated.Nimbus$killTopologyWithOpts_result.read(Nimbus.java:7410)
at org.apache.thrift7.TServiceClient.receiveBase(TServiceClient.java:78)
at backtype.storm.generated.Nimbus$Client.recv_killTopologyWithOpts(Nimbus.java:283)
at backtype.storm.generated.Nimbus$Client.killTopologyWithOpts(Nimbus.java:269)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
at backtype.storm.command.kill_topology$_main.doInvoke(kill_topology.clj:27)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at backtype.storm.command.kill_topology.main(Unknown Source)
But I can see this topology activate when I execute bin/storm list.
Topology_name Status Num_tasks Num_workers Uptime_secs
-------------------------------------------------------------------
topology_name ACTIVE 73 4 423894
Any idea ? Thanks.

Spark not able to access hbase but able to access with java code

I am using spark 1.3.0 with hbase 1.0. After one week. Hbase run successful with java code. But when using Hbase with spark giving error. I also check with hbase shell is work fine. This errors occurred after long time otherwise work fine with spark also.
I already check hadoop and hbase cluster health is fine.
at spark UI
Caused by: java.io.IOException: Enable/Disable failed
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:110)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isTableDisabled(ConnectionManager.java:907)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1076)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1064)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:885)
at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:78)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
... 23 more
Caused by:org.apache.zookeeper.KeeperException$ConnectionLossException:KeeperErrorCode = ConnectionLoss for /hbase/table/my_sample_table
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:360)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:685)
at org.apache.hadoop.hbase.zookeeper.ZKTableStateClientSideReader.getTableState(ZKTableStateClientSideReader.java:186)
at org.apache.hadoop.hbase.zookeeper.ZKTableStateClientSideReader.isDisabledTable(ZKTableStateClientSideReader.java:60)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:108)
... 29 more
In some cases
WARN ZKUtil:484 - hconnection-0x792dd4040x0, quorum=Megatron:2222, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:222)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:481)
at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getCat org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
clusterId(ZooKeeperRegistry.java:86)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:833)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:623)
at sun.reflect.GeneratedConstructorAccessor530.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:218)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.initialize(TableInputFormat.java:183)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:230)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:237)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1511)
at org.apache.spark.rdd.RDD.count(RDD.scala:1006)
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:412)
at org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:32)
But error occurred first when try to put something into hbase.
Second error when try to read from hbase.
The problem is with the HBaese Table locking. The is in hung up or limbo state.
There are some suggested solutions.
Run following commands:
hbase hbck -fix
HMaster reboot
If above solution fails
Restart the Master
If above solution fails
Restart The Zookeeper
If everything fails
Restart the cluster
More details can be found at this url:
JIRA Ticket

Not able to get rid of famous DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: error

I am running hadoop cluster with Ubuntu host as master-slave and virtual machine running on it as another slave(2 node cluster).
It seems the solution to the problem which is supposed to be resolved at No data nodes are started is not working for me. I tried both the solutions explained there.
It seems that when i manually equate the namespace ids of the affected datanodes to name node
and start the cluster(solution 2 in the linked post) i still get the same error( DataStreamer Exception).
Next the logs of one of the datanode shows the same Incompatible namespaceIDs error, but the namespace id of data node which is shown in the log is different from that present my tmp/dfs/data/current/version file(which is not changed and is same as that of tmp/dfs/name/current/version)
After many hours of debugging i am still clueless :(.
PS:
There is no connection problem from my host to slave
When i start the cluster using start-dfs.sh then datanodes on both the nodes are started which is normall just to clarify.
This error i am facing when i copy file from local to hdfs.
I performed a simple test after all this
Deleted the tmp/dfs/data and tmp/dfs/name folder on master
Deleted tmp/dfs/data on slave
format the namenode using hadoop namenode -format
started the cluster using start-dfs. all nodes
it started normally on master and datanode is also up on slave
now ran the copyfromLocal command and it gave me the
same error as below
But this time there is no namespace mismatch error in any of datanode logs master or slave
14/05/04 04:12:54 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/dsingh/mysample could only be replicated to 0 nodes, instead of
1 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1920)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1113) at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at
com.sun.proxy.$Proxy1.addBlock(Unknown Source) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy1.addBlock(Unknown Source) at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3720)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3580)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2783)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3023)
14/05/04 04:12:54 WARN hdfs.DFSClient: Error Recovery for null bad
datanode[0] nodes == null 14/05/04 04:12:54 WARN hdfs.DFSClient: Could
not get block locations. Source file "/user/dsingh/mysample" -
Aborting... put: java.io.IOException: File /user/dsingh/mysample could
only be replicated to 0 nodes, instead of 1 14/05/04 04:12:54 ERROR
hdfs.DFSClient: Failed to close file /user/dsingh/mysample
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/dsingh/mysample could only be replicated to 0 nodes, instead of
1 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1920)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1113) at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at
com.sun.proxy.$Proxy1.addBlock(Unknown Source) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy1.addBlock(Unknown Source) at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3720)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3580)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2783)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3023)
Any clue will help me .
After working for few hours on this issue. I finally gave up and it is still unresolved in my universe of knowledge.]
But the good thing is that instead of using a virtual box as slave on the same machine I connected another ubuntu machine with my master and every thing worked like charm :)
The problem i guess could be related to limited virtual memory allocation for storage in Virtual machine(It was less than 500Mb) in my case and i have read somewhere that each node in the cluster should have atleast 10 GB of free space to keep HDFS happy.
My take away if possible try the hadoop cluster on 2 separate machines rather than using Virtual machine on same host
After you did -copyFromLocal, seems like the Datanode was up to get request to write the file. However, it wasn't able to allocate the blocks needed for the file. Please check the Datanode log to see exactly what happened. Also, run "hdfs dfsadmin -report" to make sure you have enough space on the Datanode.
I have faced the same problem. This is all about the lack of space dedicated to hdfs.
I have 10 virtual machine (vmware) nodes which has 3.5 GB average storage for hdfs.I am using hadoop 2.6.
You can decrease the number of replication by going your "_hadoop_location/etc/hadoop/hdfs-site.xml" (for hadoop 2.6) configuration file's value of "dfs.replication" property.You can decrease to small number (like 1 or 2) and then try to keep file smaller than your total space.
If it shows the same problem try file size smaller than you used last time or recreate machine with larger disk size.
May be late but it may help others who faced the same problem :)
Thank you.

hadoop running application- ERROR security.UserGroupInformation: PriviledgedActionException

I have written WordCount code of hadoop as an java application in eclipse to test hadoop for running applications, but when I try to run it as hdfs user, this error will appear:
./hadoop jar /home/masi/eclipse_workspace/WordCount_apacheSample/bin/test2.jar WordCountApacheSample /user/hdfs/wordCountInput /user/hdfs/wordCountOutput
13/10/02 17:14:50 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
13/10/02 17:14:50 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
13/10/02 17:14:50 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.net.ConnectException: Call From virtual-machine/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Exception in thread "main" java.net.ConnectException: Call From virtual-machine/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:780)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:727)
at org.apache.hadoop.ipc.Client.call(Client.java:1239)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at sun.proxy.$Proxy9.getFileInfo(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at sun.proxy.$Proxy9.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:630)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1559)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:811)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1345)
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:140)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:418)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:333)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1218)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1215)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1215)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1236)
at WordCountApacheSample.main(WordCountApacheSample.java:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:526)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:490)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:508)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:603)
at org.apache.hadoop.ipc.Client$Connection.access$2100(Client.java:253)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1288)
at org.apache.hadoop.ipc.Client.call(Client.java:1206)
... 29 more
although I have tested input and output paths with hdfs://localhost:9000/ , there is no difference!
BTW, I have studied many posts related to my problem but they were not useful
any help is appreciated.
thanks.
finally I solved the problem by myself and decide to tell the reason here to help others :) the reason sounds somehow silly but the problem was this : the hadoop daemons were stop! my VM shut down suddenly and after restarting the VM, I had forgotten to start daemons(datanode, namenode,...) again! so the reason of this problem is this: datanode and namenode and other daemons are not running.
if you discover that your hdfs is corrupt then you can do following:
sudo -su hdfs
hadoop fsck /
hadoop dfsadmin -safemode leave
... - then delete the corrupted files if any -using following:
hadoop fs -rmr -skipTrash <folder with your files>
hadoop fsck -files delete /
check status :
hadoop fsck /
status should be HEALTHY after this - then manually restart everything in Ambari
I tried this on a small cluster and managed to get it up and running again after having similar error as mentioned above

Solaris - Why is java.lang.UNIXProcess.forkAndExec(Native Method) hanging

I have a java application running on Solaris. This application regularily launches external processes using Runtime.exec. It seems that after a while, having successfully launched such processes many time over, a launching of a process will hang. A thread dump taken at this point (and several minutes later) reveals that java.lang.UNIXProcess.forkAndExec is "stuck". Following is the top of the relevant stack trace taken from the thread dump:
"Thread-85305" prio=3 tid=0x0000000102aae800 nid=0x21499 runnable [0x7fffffff2a3fe000]
java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(Unknown Source)
at java.lang.ProcessImpl.start(Unknown Source)
at java.lang.ProcessBuilder.start(Unknown Source)
at java.lang.Runtime.exec(Unknown Source)
at java.lang.Runtime.exec(Unknown Source)
I have read through some forums where others have experienced forAndExec throwing an IOException due to not enough space or not enough memory, but I'm not getting this error here. I'm now waiting to get the results of pstack in the hope that it will reveal more information.
Does anyone have any idea on how to resolve this issue?
thanks,
Mike
Installing the Sun Alert Patch Cluster will do the trick.
As the thread live as long as your executable is running, maybe it's only your external executable which is hanging.
You should try to find the parameters passed to your executable and try to launch it manually.
Is stdout and stderr from the process being consumed? You might be looking at the results of a full buffer.
You can test this by adding output redirection to a tempfile for the command that is spawned.

Categories

Resources