I have deployed Spring Boot application that has a Database based queue with jobs on App Service.
Yesterday I performed a few Scale out and Scale in operations while the application was working to see how it will behave.
At some point (not necessary related to scaling operations) application started to throw Hikari errors.
com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#1ae66f34 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection org.postgresql.jdbc.PgConnection#1ef85079 marked as broken because of SQLSTATE(08006), ErrorCode(0)
The following are stack traces from my scheduled job in spring and other information:
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Caused by: javax.net.ssl.SSLException: Connection reset by peer (Write failed)
Suppressed: java.net.SocketException: Broken pipe (Write failed)
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
Next the following stack of errors:
WARN 1 --- [ scheduling-1] com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#48d0d6da (This connection has been closed.).
Possibly consider using a shorter maxLifetime value.
org.springframework.jdbc.support.MetaDataAccessException: Error while extracting DatabaseMetaData; nested exception is java.sql.SQLException: Connection is closed
Caused by: java.sql.SQLException: Connection is closed
The code which is invoked periodically - every 500 milliseconds is here:
#Scheduled(fixedDelayString = "${worker.delay}")
#Transactional
public void execute() {
jobManager.next(jobClass).ifPresent(this::handleJob);
}
Update.
The above code is almost all the time doing nothing, since there was no traffic on the website.
Update2. I've checked Postgres logs and found this:
2020-07-11 22:48:09 UTC-5f0866f0.f0-LOG: checkpoint starting: immediate force wait
2020-07-11 22:48:10 UTC-5f0866f0.f0-LOG: checkpoint complete (240): wrote 30 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.046 s, sync=0.046 s, total=0.437 s; sync files=13, longest=0.009 s, average=0.003 s; distance=163 kB, estimate=13180 kB
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: received immediate shutdown request
2020-07-11 22:48:10 UTC-5f0a3f41.8914-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0a3f41.8914-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
// Same text about 10 times
2020-07-11 22:48:10 UTC-5f0866f2.7c-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: src/port/kill.c(84): Process (272) exited OOB of pgkill.
2020-07-11 22:48:10 UTC-5f0866f1.fc-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0866f1.fc-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-07-11 22:48:10 UTC-5f0866f1.fc-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: archiver process (PID 256) exited with exit code 1
2020-07-11 22:48:11 UTC-5f0866ee.68-LOG: database system is shut down
It looks like it is a problem with Azure PostgresSQL server and it closed itself. Am I reading this right?
Like mentioned in your logs, have you tried setting maxLifetime property for the Hikari CP ? I think after setting that property this issue should be resolved.
Based on Hikari doc (https://github.com/brettwooldridge/HikariCP) --
maxLifetime
This property controls the maximum lifetime of a connection in the pool. An in-use connection will never be retired, only when it is closed will it then be removed. On a connection-by-connection basis, minor negative attenuation is applied to avoid mass-extinction in the pool. We strongly recommend setting this value, and it should be several seconds shorter than any database or infrastructure imposed connection time limit. A value of 0 indicates no maximum lifetime (infinite lifetime), subject of course to the idleTimeout setting. The minimum allowed value is 30000ms (30 seconds). Default: 1800000 (30 minutes)
I'm using apache-rocketmq to send message,but got an exception.I tryed much solutions from csdn, but it does't work.Now i hava no idea how to do with it.
This is a Linux server, running rocketmq 4.2.0, java 8 and tomcat 8.
2019-05-06 15:16:01,440 WARN RocketmqClient(256) - doRebalance, eventConsumer, but the topic[alarm_event] not exist.
2019-05-06 15:16:09,230 INFO RocketmqRemoting(454) - createChannel: begin to connect remote host[122.114.164.162:24314] asynchronously
2019-05-06 15:16:09,232 INFO RocketmqRemoting(615) - NETTY CLIENT PIPELINE: CONNECT UNKNOW => /122.114.164.162:24314
2019-05-06 15:16:12,232 WARN RocketmqRemoting(477) - createChannel: connect remote host[122.114.164.162:24314] timeout 3000ms, DefaultChannelPromise#59d7f51c(uncancellable)
2019-05-06 15:16:12,234 INFO RocketmqRemoting(640) - NETTY CLIENT PIPELINE: CLOSE
2019-05-06 15:16:12,234 INFO RocketmqRemoting(286) - closeChannel: the channel[122.114.164.162:24314] was removed from channel table
2019-05-06 15:16:12,234 INFO RocketmqRemoting(640) - NETTY CLIENT PIPELINE: CLOSE
2019-05-06 15:16:12,234 INFO RocketmqRemoting(280) - eventCloseChannel: the channel[null] has been removed from the channel table before
2019-05-06 15:16:12,235 INFO RocketmqRemoting(203) - closeChannel: close the connection to remote address[] result: true
2019-05-06 15:16:21,441 WARN RocketmqClient(256) - doRebalance, eventConsumer, but the topic[alarm_event] not exist.
2019-05-06 15:16:29,558 WARN RocketmqClient(1212) - get Topic [alarm_event] RouteInfoFromNameServer is not exist value
2019-05-06 15:16:29,558 WARN RocketmqClient(658) - updateTopicRouteInfoFromNameServer Exception
org.apache.rocketmq.client.exception.MQClientException: CODE: 17 DESC: No topic route info in name server for the topic: alarm_event
See http://rocketmq.apache.org/docs/faq/ for further details.
at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1227)
at org.apache.rocketmq.client.impl.MQClientAPIImpl.getTopicRouteInfoFromNameServer(MQClientAPIImpl.java:1197)
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:605)
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:492)
at org.apache.rocketmq.client.impl.factory.MQClientInstance.updateTopicRouteInfoFromNameServer(MQClientInstance.java:361)
at org.apache.rocketmq.client.impl.factory.MQClientInstance$3.run(MQClientInstance.java:278)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
you should run a rocketmq-console-ng application to manage your topic,and then you can create a topic that you want to use in your program.
the address of rocketmq-console-ng: https://github.com/apache/rocketmq-externals
I am trying tranquility with Druid 0.11 and Kafka. When tranquility receive new data it throw the following exception:
2018-01-12 18:27:34,010 [Curator-ServiceCache-0] INFO c.m.c.s.net.finagle.DiscoResolver - Updating instances for service[firehose:druid:overlord:flow-018-0000-0000] to Set(ServiceInstance{name='firehose:druid:overlord:flow-018-0000-0000', id='ea85b248-0c53-4ec1-94a6-517525f72e31', address='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, sslPort=-1, payload=null, registrationTimeUTC=1515781653895, serviceType=DYNAMIC, uriSpec=null})
Jan 12, 2018 6:27:37 PM com.twitter.finagle.netty3.channel.ChannelStatsHandler exceptionCaught
WARNING: ChannelStatsHandler caught an exception
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
at org.jboss.netty.channel.SimpleChannelHandler.connectRequested(SimpleChannelHandler.java:306)
The worker was created by middle Manager:
2018-01-12T18:27:25,704 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Submitting runnable for task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
2018-01-12T18:27:25,719 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Affirmative. Running task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
And tranquility talk with overlord fine... I think by the following logs:
2018-01-12T18:27:25,268 INFO [qtp271944754-62] io.druid.indexing.overlord.TaskLockbox - Adding task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] to activeTasks
2018-01-12T18:27:25,272 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_realtime_flow_2018-01-12T18:00:00.000Z_0_0
2018-01-12T18:27:25,272 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_realtime_flow_2018-01-12T18:00:00.000Z_0_0
2018-01-12T18:27:25,279 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selection strategy set. Using default of [EqualDistributionWorkerSelectStrategy]
2018-01-12T18:27:25,294 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] to add task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
2018-01-12T18:27:25,334 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_flow_2018-01-12T18:00:00.000Z_0_0 switched from pending to running (on [druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091])
2018-01-12T18:27:25,336 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] status changed to [RUNNING].
2018-01-12T18:27:25,747 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] wrote RUNNING status for task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] on [TaskLocation{host='null', port=-1, tlsPort=-1}]
2018-01-12T18:27:25,829 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] wrote RUNNING status for task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] on [TaskLocation{host='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, tlsPort=-1}]
2018-01-12T18:27:25,829 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] location changed to [TaskLocation{host='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, tlsPort=-1}].
What's wrong? I tried a thousand things and nothing solves it ...
Thanks a lot
UnresolvedAddressException being hit by Druid broker
You have to have all the druid cluster information set in you servers running tranquility.
It's because you only get DNS of you druid cluster from zookeeper, not the IP.
For example, on linux server, save you cluster information in /etc/hosts.
When I start hbase on my cluster, HMaster process and HQuorumPeer process start on master node while only HQuorumPeer process starts on slaves.
In the GUI console, in the task section, I can see the master (node0) in the state RUNNING and the status "Waiting for region servers count to settle; currently checked in 0, slept for 250920 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms".
In the software attributes section I can find all my nodes in the zookeeper quorum with the description "Addresses of all registered ZK servers".
So It seems that Zookeeper is working but in the log file it seems to be the problem.
Log hbase-clusterhadoop-master:
2016-09-08 12:26:14,875 INFO [main-SendThread(node0:2181)] zookeeper.ClientCnxn: Opening socket connection to server node0/192.168.1.113:2181. Will not attempt to authenticate using SASL (java.lang.SecurityException: Impossibile trovare una configurazione di login) 2016-09-08 12:26:14,882 WARN [main-SendThread(node0:2181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-09-08 12:26:14,994 WARN [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=node3:2181,node2:2181,node1:2181,node0:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
........
2016-09-08 12:32:53,063 INFO [master:node0:60000] zookeeper.ZooKeeper: Initiating client connection, connectString=node3:2181,node2:2181,node1:2181,node0:2181 sessionTimeout=90000 watcher=replicationLogCleaner0x0, quorum=node3:2181,node2:2181,node1:2181,node0:2181, baseZNode=/hbase
2016-09-08 12:32:53,064 INFO [master:node0:60000-SendThread(node3:2181)] zookeeper.ClientCnxn: Opening socket connection to server node3/192.168.1.112:2181. Will not attempt to authenticate using SASL (java.lang.SecurityException: Impossibile trovare una configurazione di login)
2016-09-08 12:32:53,065 INFO [master:node0:60000-SendThread(node3:2181)] zookeeper.ClientCnxn: Socket connection established to node3/192.168.1.112:2181, initiating session
2016-09-08 12:32:53,069 INFO [master:node0:60000-SendThread(node3:2181)] zookeeper.ClientCnxn: Session establishment complete on server node3/192.168.1.112:2181, sessionid = 0x357095a4b940001, negotiated timeout = 90000
2016-09-08 12:32:53,072 INFO [master:node0:60000] zookeeper.RecoverableZooKeeper: Node /hbase/replication/rs already exists and this is not a retry
2016-09-08 12:32:53,072 DEBUG [master:node0:60000] cleaner.CleanerChore: initialize cleaner=org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner
2016-09-08 12:32:53,075 DEBUG [master:node0:60000] cleaner.CleanerChore: initialize cleaner=org.apache.hadoop.hbase.master.snapshot.SnapshotLogCleaner
2016-09-08 12:32:53,076 DEBUG [master:node0:60000] cleaner.CleanerChore: initialize cleaner=org.apache.hadoop.hbase.master.cleaner.HFileLinkCleaner
2016-09-08 12:32:53,077 DEBUG [master:node0:60000] cleaner.CleanerChore: initialize cleaner=org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner
2016-09-08 12:32:53,078 DEBUG [master:node0:60000] cleaner.CleanerChore: initialize cleaner=org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner
2016-09-08 12:32:53,078 INFO [master:node0:60000] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2016-09-08 12:32:54,607 INFO [master:node0:60000] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 1529 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
2016-09-08 12:32:56,137 INFO [master:node0:60000] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 3059 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
Log hbase-clusterhadoop-zookeeper-node0 (master):
2016-09-08 12:26:18,315 WARN [WorkerSender[myid=0]] quorum.QuorumCnxManager: Cannot open channel to 1 at election address node1/192.168.1.156:3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:382)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:241)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:228)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:431)
at java.net.Socket.connect(Socket.java:527)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
at java.lang.Thread.run(Thread.java:695)
Log hbase-clusterhadoop-regionserver-node1 (one of the slave):
2016-09-08 12:33:32,690 INFO [regionserver60020-SendThread(node3:2181)] zookeeper.ClientCnxn: Opening socket connection to server node3/192.168.1.112:2181. Will not attempt to authenticate using SASL (java.lang.SecurityException: Impossibile trovare una configurazione di login)
2016-09-08 12:33:32,691 INFO [regionserver60020-SendThread(node3:2181)] zookeeper.ClientCnxn: Socket connection established to node3/192.168.1.112:2181, initiating session
2016-09-08 12:33:32,692 INFO [regionserver60020-SendThread(node3:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2016-09-08 12:33:32,793 WARN [regionserver60020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=node3:2181,node2:2181,node1:2181,node0:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2016-09-08 12:33:32,794 ERROR [regionserver60020] zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 4 attempts
2016-09-08 12:33:32,794 WARN [regionserver60020] zookeeper.ZKUtil: regionserver:600200x0, quorum=node3:2181,node2:2181,node1:2181,node0:2181, baseZNode=/hbase Unable to set watcher on znode /hbase/master
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:222)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:427)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:778)
at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:751)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:884)
at java.lang.Thread.run(Thread.java:695)
2016-09-08 12:33:32,794 ERROR [regionserver60020] zookeeper.ZooKeeperWatcher: regionserver:600200x0, quorum=node3:2181,node2:2181,node1:2181,node0:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:222)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:427)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:778)
at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:751)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:884)
at java.lang.Thread.run(Thread.java:695)
2016-09-08 12:33:32,795 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server node1,60020,1473330794709: Unexpected exception during initialization, aborting
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:222)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:427)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:778)
at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:751)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:884)
at java.lang.Thread.run(Thread.java:695)
2016-09-08 12:33:32,798 FATAL [regionserver60020] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2016-09-08 12:33:32,798 INFO [regionserver60020] regionserver.HRegionServer: STOPPED: Unexpected exception during initialization, aborting
2016-09-08 12:33:32,867 INFO [regionserver60020-SendThread(node0:2181)] zookeeper.ClientCnxn: Opening socket connection to server node0/192.168.1.113:2181. Will not attempt to authenticate using SASL (java.lang.SecurityException: Impossibile trovare una configurazione di login)
Log hbase-clusterhadoop-zookeeper-node1:
2016-09-08 12:33:32,075 WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0%0:2181] quorum.Learner: Unexpected exception, tries=0, connecting to node3/192.168.1.112:2888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:382)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:241)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:228)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:431)
at java.net.Socket.connect(Socket.java:527)
at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
2016-09-08 12:33:32,227 INFO [node1/192.168.1.156:3888] quorum.QuorumCnxManager: Received connection request /192.168.1.113:49844
2016-09-08 12:33:32,233 INFO [WorkerReceiver[myid=1]] quorum.FastLeaderElection: Notification: 1 (message format version), 0 (n.leader), 0x10000002d (n.zxid), 0x1 (n.round), LOOKING (n.state), 0 (n.sid), 0x1 (n.peerEpoch) FOLLOWING (my state)
2016-09-08 12:33:32,239 INFO [WorkerReceiver[myid=1]] quorum.FastLeaderElection: Notification: 1 (message format version), 3 (n.leader), 0x10000002d (n.zxid), 0x1 (n.round), LOOKING (n.state), 0 (n.sid), 0x1 (n.peerEpoch) FOLLOWING (my state)
2016-09-08 12:33:32,725 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /192.168.1.111:49534
2016-09-08 12:33:32,725 WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2016-09-08 12:33:32,725 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: Closed socket connection for client /192.168.1.111:49534 (no session established for client)
The conf file abase-site:
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://node0:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node0,node1,node2,node3</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/clusterhadoop/usr/local/zookeeper</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/Users/clusterhadoop/usr/local/hbtmp</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>hbase.master</name>
<value>node0:60000</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.maxClientCnxns</name>
<value>1000</value>
</property>
</configuration>
Hosts file:
127.0.0.1 localhost
127.0.0.1 node3
192.168.1.112 node3
192.168.1.156 node1
192.168.1.111 node2
192.168.1.113 node0
Any idea on what is the problem and how to solve it?
My hama throws the following exception during the input data partition phase before actually running my BSP job. Can I know what are the possible root causes of this exception? Any suggestions about how to find out the root cause is appreciated. Thank you!
13/11/06 03:50:50 WARN snappy.LoadSnappy: Snappy native library not loaded
13/11/06 03:50:50 INFO sync.ZKSyncClient: Initializing ZK Sync Client
13/11/06 03:50:50 INFO sync.ZooKeeperSyncClientImpl: Start connecting to Zookeeper! At masked-addr:33960
13/11/06 03:50:52 ERROR bsp.BSPTask: Error running bsp setup and bsp function.
java.lang.NullPointerException
at java.lang.Class.isAssignableFrom(Native Method)
at org.apache.hadoop.io.serializer.WritableSerialization.accept(WritableSerialization.java:100)
at org.apache.hadoop.io.serializer.SerializationFactory.getSerialization(SerializationFactory.java:83)
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:963)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:896)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:393)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
at org.apache.hama.bsp.PartitioningRunner.bsp(PartitioningRunner.java:217)
at org.apache.hama.bsp.BSPTask.runBSP(BSPTask.java:177)
at org.apache.hama.bsp.BSPTask.run(BSPTask.java:146)
at org.apache.hama.bsp.GroomServer$BSPPeerChild.main(GroomServer.java:1246)
13/11/06 03:50:52 INFO ipc.Server: Stopping server on 33960
13/11/06 03:50:52 INFO ipc.Server: IPC Server handler 0 on 33960: exiting
13/11/06 03:50:52 INFO ipc.Server: Stopping IPC Server listener on 33960
Found the root cause. This exception happens when at least one of the input files specified in the input paths is size 0.