I set up a private ethereum network by go-ethereum,and I configure the Ethereumj's config to connect to the private network,I can see the Ethereumj's information in go-ethereum's console:
> admin.peers
[{
caps: ["eth/62", "eth/63"],
id: "e084894a3b72e8a990710a8f84b2d6f99ac15c0a1d0d7f1a6510769633b64067f9c2df2074e920a4e46fc7d7eb1b211c06f189e5325f0856d326e32d87f49d20",
name: "Ethereum(J)/v1.5.0/Windows/Dev/Java/Dev",
network: {
localAddress: "127.0.0.1:30303",
remoteAddress: "127.0.0.1:18499"
},
protocols: {
eth: {
difficulty: 7746910281,
head: "0x97568a8b38cce14776d5daee5169954f76007a79d7329f71e48c673e6e533215",
version: 63
}
}
}]
>
then I run the sample of deploying contract(CreateContractSample.java),and the go-ethereum is mining on the private network,but I get the output:
14:10:37.969 INFO [sample] [v] Available Eth nodes found.
14:10:37.969 INFO [sample] Searching for peers to sync with...
14:10:40.970 INFO [sample] [v] At least one sync peer found.
14:10:40.970 INFO [sample] Current BEST block: #10105 (0fc0c0 <~ e6a78f) Txs:0, Unc: 0
14:10:40.970 INFO [sample] Waiting for blocks start importing (may take a while)...
14:10:46.973 INFO [sample] [v] Blocks import started.
14:10:46.973 INFO [sample] Waiting for the whole blockchain sync (will take up to several hours for the whole chain)...
14:10:56.974 INFO [sample] [v] Sync complete! The best block: #10109 (90766e <~ 46ebc3) Txs:0, Unc: 0
14:10:56.974 INFO [sample] Compiling contract...
14:10:57.078 INFO [sample] Sending contract to net and waiting for inclusion
cd2a3d9f938e13cd947ec05abc7fe734df8dd826
14:10:57.093 INFO [sample] <=== Sending transaction: TransactionData [hash= nonce=00, gasPrice=104c533c00, gas=2dc6c0, receiveAddress=, sendAddress=cd2a3d9f938e13cd947ec05abc7fe734df8dd826, value=, data=6060604052346000575b6096806100176000396000f300606060405263ffffffff60e060020a600035041663623845d88114602c5780636d4ce63c14603b575b6000565b3460005760396004356057565b005b3460005760456063565b60408051918252519081900360200190f35b60008054820190555b50565b6000545b905600a165627a7a72305820f4c00cf17626a18f19d4bb01d62482537e347bbc8c2ae2b0a464dbf1794f7c260029, signatureV=28, signatureR=33b12df9ac0351f1caa816161f5cc1dec30e288d97c02aedd3aafc59e9faafd1, signatureS=457954d2d34e88cdd022100dde72042c79e918ff6d5ddd372334a614a03a331a]
''''''some Unnecessary output''''''
java.lang.RuntimeException: The transaction was not included during last 16 blocks: ed2b6b59
at org.ethereum.samples.CreateContractSample.waitForTx(CreateContractSample.java:142)
at org.ethereum.samples.CreateContractSample.sendTxAndWait(CreateContractSample.java:116)
at org.ethereum.samples.CreateContractSample.onSyncDone(CreateContractSample.java:73)
at org.ethereum.samples.BasicSample.run(BasicSample.java:148)
at java.lang.Thread.run(Thread.java:745)
I want to know the reason of this error.
Related
I am trying tranquility with Druid 0.11 and Kafka. When tranquility receive new data it throw the following exception:
2018-01-12 18:27:34,010 [Curator-ServiceCache-0] INFO c.m.c.s.net.finagle.DiscoResolver - Updating instances for service[firehose:druid:overlord:flow-018-0000-0000] to Set(ServiceInstance{name='firehose:druid:overlord:flow-018-0000-0000', id='ea85b248-0c53-4ec1-94a6-517525f72e31', address='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, sslPort=-1, payload=null, registrationTimeUTC=1515781653895, serviceType=DYNAMIC, uriSpec=null})
Jan 12, 2018 6:27:37 PM com.twitter.finagle.netty3.channel.ChannelStatsHandler exceptionCaught
WARNING: ChannelStatsHandler caught an exception
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
at org.jboss.netty.channel.SimpleChannelHandler.connectRequested(SimpleChannelHandler.java:306)
The worker was created by middle Manager:
2018-01-12T18:27:25,704 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Submitting runnable for task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
2018-01-12T18:27:25,719 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Affirmative. Running task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
And tranquility talk with overlord fine... I think by the following logs:
2018-01-12T18:27:25,268 INFO [qtp271944754-62] io.druid.indexing.overlord.TaskLockbox - Adding task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] to activeTasks
2018-01-12T18:27:25,272 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_realtime_flow_2018-01-12T18:00:00.000Z_0_0
2018-01-12T18:27:25,272 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_realtime_flow_2018-01-12T18:00:00.000Z_0_0
2018-01-12T18:27:25,279 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selection strategy set. Using default of [EqualDistributionWorkerSelectStrategy]
2018-01-12T18:27:25,294 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] to add task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
2018-01-12T18:27:25,334 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_flow_2018-01-12T18:00:00.000Z_0_0 switched from pending to running (on [druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091])
2018-01-12T18:27:25,336 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] status changed to [RUNNING].
2018-01-12T18:27:25,747 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] wrote RUNNING status for task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] on [TaskLocation{host='null', port=-1, tlsPort=-1}]
2018-01-12T18:27:25,829 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] wrote RUNNING status for task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] on [TaskLocation{host='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, tlsPort=-1}]
2018-01-12T18:27:25,829 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] location changed to [TaskLocation{host='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, tlsPort=-1}].
What's wrong? I tried a thousand things and nothing solves it ...
Thanks a lot
UnresolvedAddressException being hit by Druid broker
You have to have all the druid cluster information set in you servers running tranquility.
It's because you only get DNS of you druid cluster from zookeeper, not the IP.
For example, on linux server, save you cluster information in /etc/hosts.
I have a neo4j 3.2.1 multi-labeled multi-properties graph database which has 4M nodes, 15M edges, and 4.8M distinct labels with ~6GB size on the disk.
I've imported the dataset using "neo4j-import" tool using a linux machine.
I can open the dataset, traverse the nodes, edges, and their descriptions well using the Java API. However, once I want to shut it down, it takes a lot of time and finally, it gives me the following log file error:
2017-08-04 07:07:38.189+0000 INFO [o.n.k.i.f.GraphDatabaseFacadeFactory] Shutdown started
2017-08-04 07:07:38.190+0000 INFO [o.n.k.i.f.GraphDatabaseFacadeFactory] Database is now unavailable
2017-08-04 07:07:38.198+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [5399]: Starting check pointing...
2017-08-04 07:07:38.198+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by database shutdown [5399]: Starting store flush...
2017-08-04 07:23:35.022+0000 ERROR [o.n.k.i.t.l.c.CheckPointerImpl] Error performing check point Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
org.neo4j.kernel.impl.store.kvstore.RotationTimeoutException: Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:79)
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:52)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:311)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:288)
at org.neo4j.kernel.impl.store.counts.CountsTracker.rotate(CountsTracker.java:154)
at org.neo4j.kernel.impl.store.NeoStores.flush(NeoStores.java:242)
at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.flushAndForce(RecordStorageEngine.java:480)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.doCheckPoint(CheckPointerImpl.java:160)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.forceCheckPoint(CheckPointerImpl.java:88)
at org.neo4j.kernel.NeoStoreDataSource$3.shutdown(NeoStoreDataSource.java:794)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:489)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:206)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:766)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:120)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:191)
at org.neo4j.kernel.impl.factory.ClassicCoreSPI.shutdown(ClassicCoreSPI.java:159)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.shutdown(GraphDatabaseFacade.java:366)
at experiment.caseStudy.TestDatasetHealth.run(TestDatasetHealth.java:70)
at experiment.caseStudy.TestDatasetHealth.main(TestDatasetHealth.java:29)
2017-08-04 07:23:35.665+0000 INFO [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics START ---
2017-08-04 07:23:35.666+0000 INFO [o.n.k.i.DiagnosticsManager] --- STOPPING diagnostics END ---
In the Java itself, I get the following exception:
Exception in thread "main" org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.kernel.NeoStoreDataSource$3#3101ffd3' failed to transition from stopped to shutting_down. Please see the attached cause exception "Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815".
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:497)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:206)
at org.neo4j.kernel.NeoStoreDataSource.stop(NeoStoreDataSource.java:766)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.impl.transaction.state.DataSourceManager.stop(DataSourceManager.java:120)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.stop(LifeSupport.java:458)
at org.neo4j.kernel.lifecycle.LifeSupport.stopInstances(LifeSupport.java:161)
at org.neo4j.kernel.lifecycle.LifeSupport.stop(LifeSupport.java:143)
at org.neo4j.kernel.lifecycle.LifeSupport.shutdown(LifeSupport.java:191)
at org.neo4j.kernel.impl.factory.ClassicCoreSPI.shutdown(ClassicCoreSPI.java:159)
at org.neo4j.kernel.impl.factory.GraphDatabaseFacade.shutdown(GraphDatabaseFacade.java:366)
at experiment.caseStudy.TestDatasetHealth.run(TestDatasetHealth.java:70)
at experiment.caseStudy.TestDatasetHealth.main(TestDatasetHealth.java:29)
Caused by: org.neo4j.kernel.impl.store.kvstore.RotationTimeoutException: Failed to rotate logs. Expected version: 5399, actual version: 5274, wait timeout (ms): 956815
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:79)
at org.neo4j.kernel.impl.store.kvstore.RotationState$Rotation.rotate(RotationState.java:52)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:311)
at org.neo4j.kernel.impl.store.kvstore.AbstractKeyValueStore$RotationTask.rotate(AbstractKeyValueStore.java:288)
at org.neo4j.kernel.impl.store.counts.CountsTracker.rotate(CountsTracker.java:154)
at org.neo4j.kernel.impl.store.NeoStores.flush(NeoStores.java:242)
at org.neo4j.kernel.impl.storageengine.impl.recordstorage.RecordStorageEngine.flushAndForce(RecordStorageEngine.java:480)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.doCheckPoint(CheckPointerImpl.java:160)
at org.neo4j.kernel.impl.transaction.log.checkpoint.CheckPointerImpl.forceCheckPoint(CheckPointerImpl.java:88)
at org.neo4j.kernel.NeoStoreDataSource$3.shutdown(NeoStoreDataSource.java:794)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.shutdown(LifeSupport.java:489)
In fact, in the Java program, I just read the information and do not write anything on the dataset.
Furthermore, to open the database using the following line of code, it takes 80 seconds on a 3.1GHz Core i7 MacBook with 16GB of Ram with 10GB of JVM arguments.
Is it normal to take this much of time for a dataset with the mentioned size?
GraphDatabaseService dataGraph = new GraphDatabaseFactory().newEmbeddedDatabase(storeDir);
Could you please guide me how can I repair the dataset to be easily shut down?
I'm currently developing a custom handler to deliver Oracle change logs.
When some errors occurred, normally, I can throw RuntimeException or return Status.ABEND. Then OGG would log the error and stop the process.
The following code works well when operationAdded() failed (i.e., Extract process will report abend, and when the Extract restart after the errors, the operations of the whole failed transaction would be resent to the handler).
#Override
public Status operationAdded(DsEvent e, DsTransaction tx,
DsOperation dsOperation) {
Status status = super.operationAdded(e, tx, dsOperation);
...
//throw new RuntimeException("op add runtime error");
return status;
}
However, when error occurred in the transactionCommit() function, OGG doesn't work as expected. Neither throw RuntimeException nor return Status.ABEND can stop the Extract. OGG just keep working like nothing happened. (Codes below)
#Override
public Status transactionCommit(DsEvent e, DsTransaction tx) {
super.transactionCommit(e, tx);
Status status = sendEvents();
handlerProperties.totalTxns++;
//throw new RuntimeException("tx ci runtime error");
return Status.ABEND;
}
I tried to kill and restart the Extract process. The failed transaction were not resend to the handler. It seems that all the failed transaction data were lost !
Following are the logs of return Status.ABEND in transactionCommit():
...
DEBUG [main] (AbstractHandler.java:509) - Event: handler=ggdatahub, transactionCommit ( Commit transaction ) DsTransaction [ops=1, buffered=1, state=BEGIN, start=2015-08-21 20:04:25.842275, end=2015-08-21 20:04:25.842275]
WARN [main] (DsEventManager.java:231) - Error sending event to handler: status=ABEND, event=Commit transaction, handler=ggdatahub
Exception in thread "main" com.goldengate.atg.util.GGException: Unable to commit transaction, STATUS=ABEND
at com.goldengate.atg.datasource.UserExitDataSource.commitActiveTransaction(UserExitDataSource.java:1392)
at com.goldengate.atg.datasource.UserExitDataSource.commitTx(UserExitDataSource.java:1326)
Error occured in javawriter.c[752]:
***********************************************************************
Exception received committing transaction: com.goldengate.atg.util.GGException: Unable to commit transaction, STATUS=ABEND
DEBUG [main] (UserExitDataSource.java:504) - (JNI) C-user-exit checkpoint event
DEBUG [main] (UserExitDataSource.java:1364) - UserExitDataSource.CommitActiveTransaction: Same transaction committed more than once (possibly due to commit-on-checkpoint).
DEBUG [main] (UserExitDataSource.java:516) - UserExitDataSource.userExitCheckpoint: incrementing the flush counter
DEBUG [main] (PendingOpGroup.java:315) - now ready to checkpoint? false (was ready? false): {pendingOps=1, groupSize=0, timer=0:00:00.000 [total = 0 ms ]}
DEBUG [main] (UserExitDataSource.java:504) - (JNI) C-user-exit checkpoint event
DEBUG [main] (UserExitDataSource.java:1364) - UserExitDataSource.CommitActiveTransaction: Same transaction committed more than once (possibly due to commit-on-checkpoint).
DEBUG [main] (UserExitDataSource.java:516) - UserExitDataSource.userExitCheckpoint: incrementing the flush counter
DEBUG [pool-1-thread-1] (AbstractDataSource.java:737) - [2] getStatusReport: Mon Aug 24 10:51:14 CST 2015
DEBUG [Thread-1] (UserExitDataSource.java:1601) - UserExitDataSource closing, #1 of class=UserExitDataSource
DEBUG [main] (PendingOpGroup.java:315) - now ready to checkpoint? false (was ready? false): {pendingOps=3, groupSize=0, timer=0:00:00.000 [total = 0 ms ]}
DEBUG [Thread-1] (UserExitDataSource.java:1608) - Shutting down data source; attempting a final checkpoint.
INFO [pool-1-thread-1] (AbstractDataSource.java:730) - Memory at Status : Max: 455.00 MB, Total: 60.50 MB, Free: 27.54 MB, Used: 32.96 MB
DEBUG [pool-1-thread-1] (UserExitDataSource.java:1637) - time spent checkpointing: 0:00:00.000 [total = 0 ms ]
DEBUG [Thread-1] (UserExitDataSource.java:1668) - doCheckpoint() called
INFO [pool-1-thread-1] (AbstractDataSource.java:980) - Status report: Mon Aug 24 10:51:14 CST 2015
*************************************************
Status Report for UserExit
*************************************************
Total elapsed time: 2 days 14:47:06.139 [total = 226026 sec = 3767 min = 62 hr ] => Total time since first event
Event processing time: 0:00:12.692 [total = 12 sec ] => Time spent sending msgs (max: 4795 ms)
Metadata process time: 0:00:02.159 [total = 2 sec ] => Time spent receiving metadata (1 tables, 3 columns)
Operations Received/Sent: 3 / 3
Rate (overall): 0 op/s (peak: 0 op/s)
(per event): 0 op/s
Transactions Received/Sent: 2 / 0
Rate (overall): 0 tx/s (peak: 0 tx/s)
(per event): 0 tx/s
3 records processed as of Mon Aug 24 10:51:14 CST 2015 (rate 0/sec, delta 3)
*************************************************
Anybody know how to fix this? Thanks in advance!
For others who may encounter this problem:
It turns out to be a bug...
I swithed from Version 12.1.2.1.4 20470586 OGGCORE_12.1.2.1.0OGGBP_PLATFORMS_150303.1209 to Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230. Everything works fine now.
I'm testing Couchbase Server 2.5`. I have a cluster with 7 nodes and 3 replicates. In normal condition, the system works fine.
But I failed with this test case:
Couchbase cluster's serving 40.000 ops and I stop couchbase service on one server => one node down. After that, entire cluster's performance is decreased painfully. It only can server below 1.000 ops. When I click fail-over then entire cluster return healthy.
I think when a node down then only partial request is influenced. Is that right?
And in reality, when one node down, it will make a big impact to entire cluster?
Updated:
I wrote a tool to load test use spymemcached. This tool create multi-thread to connect to Couchbase cluster. Each thread Set a key and Get this key to check immediately, if success it continues Set/Get another key. If fail, it retry Set/Get and by pass this key if fail in 5 times.
This is log of a key when I Set/Get fail.
2014-04-16 16:22:20.405 INFO net.spy.memcached.MemcachedConnection: Reconnection due to exception handling a memcached operation on {QA sa=/10.0.0.23:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 2660829 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800, topWop=null, toWrite=0, interested=1}. This may be due to an authentication failure.
OperationException: SERVER: Internal error
at net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192)
at net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244)
at net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201)
at net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196)
at net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:139)
at net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825)
at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804)
at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418)
at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1400)
2014-04-16 16:22:20.405 WARN net.spy.memcached.MemcachedConnection: Closing, and reopening {QA sa=/10.0.0.23:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 2660829 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800, topWop=null, toWrite=0, interested=1}, attempt 0.
2014-04-16 16:22:20.406 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 1 Opaque: 2660829 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800
2014-04-16 16:22:20.406 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 0 Opaque: 2660830 Key: test_key_2681412
Cancelled
2014-04-16 16:22:20.407 ERROR net.spy.memcached.protocol.binary.StoreOperationImpl: Error: Internal error
2014-04-16 16:22:20.407 INFO net.spy.memcached.MemcachedConnection: Reconnection due to exception handling a memcached operation on {QA sa=/10.0.0.24:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 2660831 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800, topWop=null, toWrite=0, interested=1}. This may be due to an authentication failure.
OperationException: SERVER: Internal error
at net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192)
at net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244)
at net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201)
at net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196)
at net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:139)
at net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825)
at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804)
at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647)
at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418)
at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1400)
2014-04-16 16:22:20.407 WARN net.spy.memcached.MemcachedConnection: Closing, and reopening {QA sa=/10.0.0.24:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 2660831 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800, topWop=null, toWrite=0, interested=1}, attempt 0.
2014-04-16 16:22:20.408 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 1 Opaque: 2660831 Key: test_key_2681412 Cas: 0 Exp: 0 Flags: 0 Data Length: 800
2014-04-16 16:22:20.408 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 0 Opaque: 2660832 Key: test_key_2681412
Cancelled
You should find that 6/7 (i.e. 85%) of your operations should continue to operate at the same performance. However the 15% of operations which are directed at the vbuckets owned by the now downed node will never complete and likely timeout, and so depending on how your application is handling these timeouts you may see a greater performance drop overall.
How are you benchmarking / measuring the performance?
Update: OP's extra details
I wrote a tool to load test use spymemcached. This tool create multi-thread to connect to Couchbase cluster. Each thread Set a key and Get this key to check immediately, if success it continues Set/Get another key. If fail, it retry Set/Get and by pass this key if fail in 5 times.
The Java SDK is designed to make use of async operations for maximum performance, and this is particularly true when the cluster is degraded and some operations will timeout. I'd suggest starting running in a single thread but using Futures to handle the get after the set. For example:
client.set("key", document).addListener(new OperationCompletionListener() {
#Override
public void onComplete(OperationFuture<?> future) throws Exception {
System.out.println("I'm done!");
}
});
This is an extract from the Understanding and Using Asynchronous Operations section of the Java Developer guide.
There's essentially no reason why given the right code your performance with 85% of nodes up shouldn't be close to 85% of the maximum for a short downtime.
Note that if a node is down for a long time then the replication queues on the other nodes will start to back up and that can impact performance, hence the recommendation of using auto-failover / rebalance to get back to 100% active buckets and re-create replicas to ensure any further node failures don't cause data loss.
I've a Camel process (that I run from command line) which route is similar to this one:
public class ProfilerRoute extends RouteBuilder {
#Override
public void configure() {
from("kestrel://my_queue?concurrentConsumers=10&waitTimeMs=500")
.unmarshal().json(JsonLibrary.Jackson, MyClass.class)
.process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
/* Do the real processing [...] */
exchange.getIn().setBody(null);
}
})
.filter(body().isNotNull())
.to("file://nowhere");
}
}
Note that I'm trashing whatever message after having processed it, being this a pure consumer
process.
The process is run by its own. No other process is writing on the queue, the queue is empty.
However when I try to kill the process the process is not going to die.
From the logs I see the following lines (indented for readability):
[ Thread-1] MainSupport$HangupInterceptor INFO
Received hang up - stopping the main instance.
[ Thread-1] MainSupport INFO
Apache Camel stopping
[ Thread-1] GuiceCamelContext INFO
Apache Camel 2.11.1 (CamelContext: camel-1)
is shutting down
[ Thread-1] DefaultShutdownStrategy INFO
Starting to graceful shutdown 1 routes
(timeout 300 seconds)
[l-1) thread #12 - ShutdownTask] DefaultShutdownStrategy INFO
Waiting as there are still 10 inflight and
pending exchanges to complete,
timeout in 300 seconds.
And so on with decreasing timeout. At the end of the timeout I get on the logs:
[l-1) thread #12 - ShutdownTask] DefaultShutdownStrategy INFO
Waiting as there are still 10 inflight and
pending exchanges to complete,
timeout in 1 seconds.
[ Thread-1] DefaultShutdownStrategy WARN
Timeout occurred.
Now forcing the routes to be shutdown now.
[l-1) thread #12 - ShutdownTask] DefaultShutdownStrategy WARN
Interrupted while waiting during graceful
shutdown, will force shutdown now.
[ Thread-1] KestrelConsumer INFO
Stopping consumer for
kestrel://localhost:22133/my_queue?concurrentConsumers=10&waitTimeMs=500
But the process will not die anyway (even if I try to kill it at this point).
I would have expected that after the waiting time all the threads would realise that a shutdown is going on and stop.
I've read the "Graceful Shutdown" document, however I could not find something that explains the behaviour I'm facing.
As you can see from logs I'm using the 2.11.1 version of Apache Camel.
UPDATE: According to Claus Ibsen it might be a problem of the camel-kestrel component. I filed a issue on ASF Jira for Camel: CAMEL-6632
This is a bug in camel-kestrel, and a JIRA ticket has been logged to fix this: https://issues.apache.org/jira/browse/CAMEL-6632