My hama throws the following exception during the input data partition phase before actually running my BSP job. Can I know what are the possible root causes of this exception? Any suggestions about how to find out the root cause is appreciated. Thank you!
13/11/06 03:50:50 WARN snappy.LoadSnappy: Snappy native library not loaded
13/11/06 03:50:50 INFO sync.ZKSyncClient: Initializing ZK Sync Client
13/11/06 03:50:50 INFO sync.ZooKeeperSyncClientImpl: Start connecting to Zookeeper! At masked-addr:33960
13/11/06 03:50:52 ERROR bsp.BSPTask: Error running bsp setup and bsp function.
java.lang.NullPointerException
at java.lang.Class.isAssignableFrom(Native Method)
at org.apache.hadoop.io.serializer.WritableSerialization.accept(WritableSerialization.java:100)
at org.apache.hadoop.io.serializer.SerializationFactory.getSerialization(SerializationFactory.java:83)
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:963)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:896)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:393)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
at org.apache.hama.bsp.PartitioningRunner.bsp(PartitioningRunner.java:217)
at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:177)
at org.apache.hama.bsp.BSPTask.run(BSPTask.java:146)
at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1246)
13/11/06 03:50:52 INFO ipc.Server: Stopping server on 33960
13/11/06 03:50:52 INFO ipc.Server: IPC Server handler 0 on 33960: exiting
13/11/06 03:50:52 INFO ipc.Server: Stopping IPC Server listener on 33960
Found the root cause. This exception happens when at least one of the input files specified in the input paths is size 0.
Related
I am trying tranquility with Druid 0.11 and Kafka. When tranquility receive new data it throw the following exception:
2018-01-12 18:27:34,010 [Curator-ServiceCache-0] INFO c.m.c.s.net.finagle.DiscoResolver - Updating instances for service[firehose:druid:overlord:flow-018-0000-0000] to Set(ServiceInstance{name='firehose:druid:overlord:flow-018-0000-0000', id='ea85b248-0c53-4ec1-94a6-517525f72e31', address='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, sslPort=-1, payload=null, registrationTimeUTC=1515781653895, serviceType=DYNAMIC, uriSpec=null})
Jan 12, 2018 6:27:37 PM com.twitter.finagle.netty3.channel.ChannelStatsHandler exceptionCaught
WARNING: ChannelStatsHandler caught an exception
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
at org.jboss.netty.channel.SimpleChannelHandler.connectRequested(SimpleChannelHandler.java:306)
The worker was created by middle Manager:
2018-01-12T18:27:25,704 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Submitting runnable for task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
2018-01-12T18:27:25,719 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Affirmative. Running task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
And tranquility talk with overlord fine... I think by the following logs:
2018-01-12T18:27:25,268 INFO [qtp271944754-62] io.druid.indexing.overlord.TaskLockbox - Adding task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] to activeTasks
2018-01-12T18:27:25,272 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_realtime_flow_2018-01-12T18:00:00.000Z_0_0
2018-01-12T18:27:25,272 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_realtime_flow_2018-01-12T18:00:00.000Z_0_0
2018-01-12T18:27:25,279 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selection strategy set. Using default of [EqualDistributionWorkerSelectStrategy]
2018-01-12T18:27:25,294 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] to add task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
2018-01-12T18:27:25,334 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_flow_2018-01-12T18:00:00.000Z_0_0 switched from pending to running (on [druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091])
2018-01-12T18:27:25,336 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] status changed to [RUNNING].
2018-01-12T18:27:25,747 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] wrote RUNNING status for task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] on [TaskLocation{host='null', port=-1, tlsPort=-1}]
2018-01-12T18:27:25,829 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] wrote RUNNING status for task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] on [TaskLocation{host='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, tlsPort=-1}]
2018-01-12T18:27:25,829 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] location changed to [TaskLocation{host='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, tlsPort=-1}].
What's wrong? I tried a thousand things and nothing solves it ...
Thanks a lot
UnresolvedAddressException being hit by Druid broker
You have to have all the druid cluster information set in you servers running tranquility.
It's because you only get DNS of you druid cluster from zookeeper, not the IP.
For example, on linux server, save you cluster information in /etc/hosts.
I have a neo4j 3.2.1 multi-labeled multi-properties graph database which has 4M nodes, 15M edges, and 4.8M distinct labels with ~6GB size on the disk.
I've imported the dataset using "neo4j-import" tool using a linux machine.
I can open the dataset, traverse the nodes, edges, and their descriptions well using the Java API. However, once I want to shut it down, it takes a lot of time and finally, it gives me the following log file error:
2017-08-04 07:07:38.189+0000 INFO [o.n.k.i.f.GraphDatabaseFacadeFactory] Shutdown started
2017-08-04 07:07:38.190+0000 INFO [o.n.k.i.f.GraphDatabaseFacadeFactory] Database is now unavailable
2017-08-04 07:07:38.198+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [5399]: Starting check pointing...
2017-08-04 07:07:38.198+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [5399]: Starting store flush...
2017-08-04 07:23:35.022+0000 ERROR [o.n.k.i.t.l.c.CheckPointerImpl] Error performing check point Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
org.neo4j.kernel.impl.store.kvstore.RotationTimeoutException: Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:79)
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:52)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:311)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:288)
at org.neo4j.kernel.impl.store.counts.CountsTracker.rotate(CountsTracker.java:154)
at org.neo4j.kernel.impl.store.NeoStores.flush(NeoStores.java:242)
at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.flushAndForce(RecordStorageEngine.java:480)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.doCheckPoint(CheckPointerImpl.java:160)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.forceCheckPoint(CheckPointerImpl.java:88)
at org.neo4j.kernel.NeoStoreDataSource$3.shutdown(NeoStoreDataSource.java:794)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:489)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:206)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:766)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:120)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:191)
at org.neo4j.kernel.impl.factory.ClassicCoreSPI.shutdown(ClassicCoreSPI.java:159)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.shutdown(GraphDatabaseFacade.java:366)
at experiment.caseStudy.TestDatasetHealth.run(TestDatasetHealth.java:70)
at experiment.caseStudy.TestDatasetHealth.main(TestDatasetHealth.java:29)
2017-08-04 07:23:35.665+0000 INFO [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics START ---
2017-08-04 07:23:35.666+0000 INFO [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics END ---
In the Java itself, I get the following exception:
Exception in thread "main" org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.kernel.NeoStoreDataSource$3#3101ffd3' failed to transition from stopped to shutting_down. Please see the attached cause exception "Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815".
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:497)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:206)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:766)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:120)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:191)
at org.neo4j.kernel.impl.factory.ClassicCoreSPI.shutdown(ClassicCoreSPI.java:159)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.shutdown(GraphDatabaseFacade.java:366)
at experiment.caseStudy.TestDatasetHealth.run(TestDatasetHealth.java:70)
at experiment.caseStudy.TestDatasetHealth.main(TestDatasetHealth.java:29)
Caused by: org.neo4j.kernel.impl.store.kvstore.RotationTimeoutException: Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:79)
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:52)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:311)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:288)
at org.neo4j.kernel.impl.store.counts.CountsTracker.rotate(CountsTracker.java:154)
at org.neo4j.kernel.impl.store.NeoStores.flush(NeoStores.java:242)
at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.flushAndForce(RecordStorageEngine.java:480)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.doCheckPoint(CheckPointerImpl.java:160)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.forceCheckPoint(CheckPointerImpl.java:88)
at org.neo4j.kernel.NeoStoreDataSource$3.shutdown(NeoStoreDataSource.java:794)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:489)
In fact, in the Java program, I just read the information and do not write anything on the dataset.
Furthermore, to open the database using the following line of code, it takes 80 seconds on a 3.1GHz Core i7 MacBook with 16GB of Ram with 10GB of JVM arguments.
Is it normal to take this much of time for a dataset with the mentioned size?
GraphDatabaseService dataGraph = new GraphDatabaseFactory().newEmbeddedDatabase(storeDir);
Could you please guide me how can I repair the dataset to be easily shut down?
I'm running a standalone Spark cluster on EC2, and I'm writing a application using Spark-Cassandra connector driver and try to submit job to Spark cluster programmatically.
The job itself is simple:
public static void main(String[] args) {
SparkConf conf;
JavaSparkContext sc;
conf = new SparkConf()
.set("spark.cassandra.connection.host", host);
conf.set("spark.driver.host", "[my_public_ip]");
conf.set("spark.driver.port", "15000");
sc = new JavaSparkContext("spark://[spark_master_host]","test",conf);
CassandraJavaRDD<CassandraRow> rdd = javaFunctions(sc).cassandraTable(
"keyspace", "table");
System.out.println(rdd.first().toString());
sc.stop();
}
Which runs fine when I run that in the Spark Master node of my EC2 cluster.
I'm trying to running this in a remote Windows client.
The problem was from these two lines:
conf.set("spark.driver.host", "[my_public_ip]");
conf.set("spark.driver.port", "15000");
First, if i comment out these 2 lines, application would not throw a exception, but the Executor is not running, with following log:
14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/3 is now LOADING
14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/0 is now EXITED (Command exited with code 1)
14/12/06 22:40:03 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141207033931-0021/0 removed: Command exited with code 1
Which never ends, when I check the worker node log, I found:
14/12/06 22:40:21 ERROR security.UserGroupInformation: PriviledgedActionException as:[username] cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
... 7 more
I've no idea what that's about, my guess is that probably worker node could not connect to driver, which probably initially set as:
14/12/06 22:39:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#[some_host_name]:52660]
14/12/06 22:39:30 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver#[some_host_name]:52660]
Obviously, no DNS is going to resolve my host name...
Since I can't set deploy mode to "client" or "cluster", if not via ./spark-submit script.(Which I think that's absurd...). I try to add a host resolution "XX.XXX.XXX.XX [host-name]" in /etc/hosts of all Spark Master Worker nodes.
No luck of course...
That leads me to the second, un-comment that two line;
Which gives me:
14/12/06 22:59:41 INFO Remoting: Starting remoting
14/12/06 22:59:41 ERROR Remoting: Remoting error: [Startup failed] [
akka.remote.RemoteTransportException: Startup failed
at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
at akka.remote.Remoting.start(Remoting.scala:194)
...
Cause:
Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: /[my_public_ip]:15000
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:391)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:388)
I double checked my firewall setting and router setting, confirm that my firewall is diabled; and netstat -an to confirm port 15000 is not in use (in fact I tried to change to several available port, no luck); and I ping my public ip from both other machine and machine from my cluster, no problem.
Now I'm utterly screw up, I just run out of idea try to fix this. Any suggestions? Any help is appreciated!
Please check if 15000 is in your security group.
My hadoop version is 0.20.203.0. The namenode running on my hadoop clulser was shut down. I checked the logs, and found the error message only in the secondary name logs:
2014-09-27 22:18:54,930 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New Image Size: 29552383
2014-09-27 22:19:42,792 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet#8135daf JVM BUG(s) - injecting delay2 times
2014-09-27 22:19:42,792 INFO org.mortbay.log: org.mortbay.io.nio.SelectorManager$SelectSet#8135daf JVM BUG(s) - recreating selector 2 times, canceled keys 38 times
2014-09-27 23:18:55,508 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
2014-09-27 23:18:55,508 FATAL org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All storage directories are inaccessible.
2014-09-27 23:18:55,509 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG:
There was anonther error message apprears in one of my datanodes:
2014-09-27 01:03:58,535 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.75.6.51:50010, storageID=DS-532990984-10.75.6.51-50010-1343295370699, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on device
at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:770)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:475)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:528)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:397)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
at java.lang.Thread.run(Thread.java:662)
Not sure whether this is the root cause of the namenode shutting down issue?
New error raised when I was trying to restart the namenode?
2014-09-28 11:25:06,202 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.
java.io.IOException: Incorrect data format. logVersion is -31 but writables.length is 0.
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:542)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1009)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:827)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:365)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:97)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:379)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:353)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:254)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:434)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1153)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1162)
Is there anyone knows about it ? Is it possible to fix the imagne and editor files as I don't want to lose the data?
I encountered the following problem when start running a hama BSP job. This exception occurs when hama tries to load and partition the input data before it actually runs my own code. This is a known problem discussed in some websites but unfortunate without a known cause (eg. see here).
My BSP job works perfectly ok when I only runs part of the data set. However, when I run the full data set, the problem occurs :(
Can I know how to resolve or avoid this problem?
13/11/18 01:19:30 INFO bsp.FileInputFormat: Total input paths to process : 32
13/11/18 01:19:30 INFO bsp.FileInputFormat: Total input paths to process : 32
13/11/18 01:19:30 INFO bsp.BSPJobClient: Running job: job_201311180115_0002
13/11/18 01:19:33 INFO bsp.BSPJobClient: Current supersteps number: 0
13/11/18 01:19:33 INFO bsp.BSPJobClient: Job failed.
13/11/18 01:19:33 ERROR bsp.BSPJobClient: Error partitioning the input path.
java.io.IOException: Runtime partition failed for the job.
at org.apache.hama.bsp.BSPJobClient.partition(BSPJobClient.java:465)
at org.apache.hama.bsp.BSPJobClient.submitJobInternal(BSPJobClient.java:333)
at org.apache.hama.bsp.BSPJobClient.submitJob(BSPJobClient.java:293)
at org.apache.hama.bsp.BSPJob.submit(BSPJob.java:228)
at org.apache.hama.bsp.BSPJob.waitForCompletion(BSPJob.java:235)
at edu.wisc.cs.db.opener.hama.ConnectedEntityBspDriver.main(ConnectedEntityBspDriver.java:183)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hama.util.RunJar.main(RunJar.java:146)
After stuck at this problem for several hours, I found that once the number of input files is greater than the number of allowed bsp tasks, then this error will occur. I think it is probably a bug that Hama should fix in the future.
A quick fix to this problem is to increase the number of maximum bsp tasks, specified by the variable bsp.tasks.maximum in the hama-site.xml file. For example, the following uses 10 instead of the default setting 3:
<property>
<name>bsp.tasks.maximum</name>
<value>10</value>
<description>The maximum number of BSP tasks that will be run simultaneously
by a groom server.</description>
</property>