ElasticSearch LockObtainFailedException on restoring index from s3 repository - java

I keep getting the following exception on trying to restore a snapshot using cloud_aws plugin from s3 repository:
[WARN ][cluster.action.shard ] [Landslide] [index_name][0] received shard failed for target shard [[index_name][0], node[U2w_femBQYO3f5TuOI5daw], [P], v[109], restoring[elasticsearch:backup_name], s[INITIALIZING], a[id=gJBMpmcVT6G132-h1ONGgw], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-09-01T13:01:53.147Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [index_name][0], timed out after 5000ms]; ]]], indexUUID [7quZdjJqRRmhzr7WBXqlgQ], message [failed to create shard], failure [ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [index_name][0], timed out after 5000ms]; ]
[index_name][[index_name][0]] ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [index_name][0], timed out after 5000ms];
at org.elasticsearch.index.IndexService.createShard(IndexService.java:389)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:601)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:501)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:166)
at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [index_name][0], timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:609)
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:537)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:306)
... 10 more
ES version 2.4.0.
There's no other operation going on on elastic server. Also note that the restore operation does complete inspite of the above exception, but it takes very long to restore a < 1GB index on a 5-6 Mbps internet. Any suggestions?
EDIT: It seems its a problem with s3 buckets. Finding no particular solution to this, i have moved on to FS repository.
i'd still be interested in the solution to this problem

Related

How do deal with h2 databases inability to deal with interrupts

How do deal with h2 database inability to deal with interrupts, I was occasionally seeing that my embedded h2 database appeared to get corrupted, in particular I had amended an ExecutorService so that if a task took too long it would cancel the task. The task would be cancelled okay but then subsequent database access failed with exceptions such as
23/07/2019 14.23.31:BST:DeleteDuplicatesController:start:SEVERE: commit failed
org.hibernate.TransactionException: commit failed
at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.commit(AbstractTransactionImpl.java:187)
at com.jthink.songkong.db.ReportCache.save(ReportCache.java:46)
at com.jthink.songkong.reports.AbstractReport.setReportDatabaseObject(AbstractReport.java:365)
at com.jthink.songkong.reports.DeleteDuplicatesReport.setReportDatabaseObject(DeleteDuplicatesReport.java:333)
at com.jthink.songkong.reports.DeleteDuplicatesReport.closeReport(DeleteDuplicatesReport.java:377)
at com.jthink.songkong.analyse.toplevelanalyzer.DeleteDuplicatesController.deleteAnyDups(DeleteDuplicatesController.java:606)
at com.jthink.songkong.analyse.toplevelanalyzer.DeleteDuplicatesController.start(DeleteDuplicatesController.java:665)
at com.jthink.songkong.ui.swingworker.DeleteDuplicates.doInBackground(DeleteDuplicates.java:43)
at com.jthink.songkong.ui.swingworker.DeleteDuplicates.doInBackground(DeleteDuplicates.java:20)
at javax.swing.SwingWorker$1.call(SwingWorker.java:295)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at javax.swing.SwingWorker.run(SwingWorker.java:334)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.hibernate.TransactionException: unable to commit against JDBC connection
at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doCommit(JdbcTransaction.java:116)
at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.commit(AbstractTransactionImpl.java:180)
... 14 more
Caused by: org.h2.jdbc.JdbcSQLNonTransientException: General error: "java.lang.IllegalStateException: Reading from nio:C:/Users/Paul/AppData/Roaming/SongKong/Database/Database.mv.db failed; file length -1 read length 4096 at 1541494 [1.4.199/1]"; SQL statement:
COMMIT [50000-199]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:502)
at org.h2.message.DbException.getJdbcSQLException(DbException.java:427)
at org.h2.message.DbException.get(DbException.java:194)
at org.h2.message.DbException.convert(DbException.java:347)
at org.h2.command.Command.executeUpdate(Command.java:280)
at org.h2.jdbc.JdbcConnection.commit(JdbcConnection.java:542)
at com.mchange.v2.c3p0.impl.NewProxyConnection.commit(NewProxyConnection.java:1284)
at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doCommit(JdbcTransaction.java:112)
... 15 more
Caused by: java.lang.IllegalStateException: Reading from nio:C:/Users/Paul/AppData/Roaming/SongKong/Database/Database.mv.db failed; file length -1 read length 4096 at 1541494 [1.4.199/1]
at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:883)
at org.h2.mvstore.DataUtils.readFully(DataUtils.java:420)
at org.h2.mvstore.FileStore.readFully(FileStore.java:98)
at org.h2.mvstore.MVStore.readBufferForPage(MVStore.java:1048)
at org.h2.mvstore.MVStore.readPage(MVStore.java:2186)
at org.h2.mvstore.MVMap.readPage(MVMap.java:554)
at org.h2.mvstore.Page$NonLeaf.getChildPage(Page.java:1086)
at org.h2.mvstore.Page.get(Page.java:221)
at org.h2.mvstore.MVMap.get(MVMap.java:402)
at org.h2.mvstore.MVMap.get(MVMap.java:389)
at org.h2.mvstore.MVStore.getMapName(MVStore.java:2737)
at org.h2.mvstore.MVStore.renameMap(MVStore.java:2650)
at org.h2.mvstore.tx.TransactionStore.commit(TransactionStore.java:453)
at org.h2.mvstore.tx.Transaction.commit(Transaction.java:389)
at org.h2.engine.Session.commit(Session.java:691)
at org.h2.command.dml.TransactionCommand.update(TransactionCommand.java:46)
at org.h2.command.CommandContainer.update(CommandContainer.java:133)
at org.h2.command.Command.executeUpdate(Command.java:267)
... 18 more
Caused by: java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:721)
at org.h2.store.fs.FileNio.read(FilePathNio.java:74)
at org.h2.mvstore.DataUtils.readFully(DataUtils.java:406)
... 34 more
23/07/2019 14.23.31:BST:Errors:addError:SEVERE: Adding Error:commit failed
I have since found this issue
Basically if using H2 in embedded mode, and it receives an interrupt then all subsequent access fails until the thread pool is close and reopened. In the example I give of a process having to be cancelled because it appears to be stuck there is no solution except for interrupting
I also have another case whereby usually the controller thread that doesn't directly do a database work itself so I was struggling to see why when an interrupt occurred why this would cause database errors since this is handled by controller thread. I have now worked out the issue is that Im using an ExecutorService with a fixed size BlockingQueue (so that we dont have a big queue build up in memory), but if the queue gets full then new task is actually executed by the controller thread (because of CallerRunsPolicy), so the controller thread can be making calls to database after all.
Im using H2 with hibernate and in both cases calling the following immediately after the interrupt
HibernateUtil.closeFactory();
seems to solve the issue, however I guess this means that any other threads with hibernate sessions will be broken, but at least newly opened sessins will be okay. So im not particularly happy with this workaround, any other ideas ?
Using H2 as a server is not a solution since the whole point of H2 was an embedded db self contained within application.
Although not properly documented using the async protocol allows a connection to be interrupted without breaking all other connections.

In an Azure event-hub created with 2 partition, the event-hub consumer tries to connect to 3rd and 4th partition

I have set up an Azure event-hub with 2 partitions. I am using the Microsoft published code (https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-java-get-started-send) to consume the messages from the event-hub. The consumer tries to connect to 3rd and 4th partitions even when they don't exist.
Error message:
2019-06-14 09:34:34,998 [ault|host]-1-13] - ERROR PartitionPump
- host host: 3: PartitionReceiver creation failed java.util.concurrent.CompletionException:
com.microsoft.azure.eventhubs.EventHubException: The specified
partition is invalid for an EventHub partition sender or receiver. It
should be between 0 and 1. Parameter name: PartitionId
TrackingId:11c0c687fa3146ffaa97749d23abeae5_G27,
SystemTracker:gateway5, Timestamp:2019-06-14T04:04:34,
errorContext[NS: dev-sams-iot.servicebus.windows.net, PATH:
test/ConsumerGroups/$Default/Partitions/3, REFERENCE_ID:
6d2875_ae5_G27_1560485074529] at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:647)
at
java.util.concurrent.CompletableFuture$UniAccept.tryFire$$$capture(CompletableFuture.java:632)
at
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at
com.microsoft.azure.eventhubs.impl.ExceptionUtil.completeExceptionally(ExceptionUtil.java:104)
at
com.microsoft.azure.eventhubs.impl.MessageReceiver.cancelOpen(MessageReceiver.java:361)
at
com.microsoft.azure.eventhubs.impl.MessageReceiver.onOpenComplete(MessageReceiver.java:351)
at
com.microsoft.azure.eventhubs.impl.MessageReceiver.onError(MessageReceiver.java:418)
at
com.microsoft.azure.eventhubs.impl.MessageReceiver.onClose(MessageReceiver.java:740)
at
com.microsoft.azure.eventhubs.impl.BaseLinkHandler.processOnClose(BaseLinkHandler.java:74)
at
com.microsoft.azure.eventhubs.impl.BaseLinkHandler.handleRemoteLinkClosed(BaseLinkHandler.java:113)
at
com.microsoft.azure.eventhubs.impl.BaseLinkHandler.onLinkRemoteClose(BaseLinkHandler.java:48)
at
org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at
org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at
org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at
org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at
com.microsoft.azure.eventhubs.impl.MessagingFactory$RunReactor.run(MessagingFactory.java:507)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
at java.util.concurrent.FutureTask.run(FutureTask.java) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
While creating an event-hub, a storage account is also needed. Event-hub stores the details like partition count and messages of the partition in the storage account.
The above is because because initially I created an event-hub with four partitions and details of it were saved in the storage account. I deleted this event-hub and then created another event-hub with two partitions. Both the event-hubs used the same storage account.
Deleting the contents of the storage account solved the issue.

Hadoop Container Cleanup Timeout on lost node

I am Working on Multinode Cluster with four slaves node named as slave01,slave02,slave03 and slave04 and one master node as master
when i remove out the network cable during map task hadoop wait for status update for 100 seconds (due to property of whose value is 100000)
after that i can see that maptask get failed and hadoop start container cleanup which takes more than 10 minutes and it also doesn't schedule failed task anywhere.the i get error of no Route to host exception from application master to lost node.After which task get schedule on another node.
i want to reduce the time for trying container cleanup so that task can be schedule just after timeout of maptask on any node.
please help me that how can i do that by setting configuration.
I am attaching application master log in which i have remove slave01 during map task,in this case no of reduce task running is 1.
AttemptID:attempt_1463201584280_0004_m_000002_0 Timed out after 100 secs Container released on a lost node cleanup failed for container container_1463201584280_0004_01_000004 : java.net.NoRouteToHostException: No Route to Host from slave02/172.31.132.107 to slave01:58838 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost at sun.reflect.GeneratedConstructorAccessor51.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy37.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy38.stopContainers(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) ... 15 more
This happened because of an bug in hadoop 2.6.3 in which connection retry in done at two levels ipc and yarn try to use 2.6.4 or download patch it will get resolved.

org.elasticsearch.common.jackson.core.JsonGenerationException: No current event to copy

We are using Apache Flume for indexing data into Elasticsearch.
Recently we are facing this exception.
org.elasticsearch.common.jackson.core.JsonGenerationException: No current event to copy
at org.elasticsearch.common.jackson.core.JsonGenerator._reportError(JsonGenerator.java:1487)
at org.elasticsearch.common.jackson.core.JsonGenerator.copyCurrentStructure(JsonGenerator.java:1390)
at org.elasticsearch.common.xcontent.json.JsonXContentGenerator.copyCurrentStructure(JsonXContentGenerator.java:332)
at org.elasticsearch.common.xcontent.XContentBuilder.copyCurrentStructure(XContentBuilder.java:1105)
at org.apache.flume.sink.elasticsearch.ContentBuilderUtil.addComplexField(ContentBuilderUtil.java:63)
at org.apache.flume.sink.elasticsearch.ContentBuilderUtil.appendField(ContentBuilderUtil.java:47)
at org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer.appendHeaders(ElasticSearchJsonBodyEventSerializer.java:48)
at org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer.getContentBuilder(ElasticSearchJsonBodyEventSerializer.java:35)
at org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.addEvent(ElasticSearchTransportClient.java:164)
at org.css.cssElasticsearchSink.process(cssElasticsearchSink.java:121)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
2015-10-15 12:32:36,402 ERROR org.apache.flume.SinkRunner: Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Failed to commit transaction. Transaction rolled back.
at org.css.cssElasticsearchSink.process(cssElasticsearchSink.java:166)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.jackson.core.JsonGenerationException: No current event to copy
at org.elasticsearch.common.jackson.core.JsonGenerator._reportError(JsonGenerator.java:1487)
at org.elasticsearch.common.jackson.core.JsonGenerator.copyCurrentStructure(JsonGenerator.java:1390)
at org.elasticsearch.common.xcontent.json.JsonXContentGenerator.copyCurrentStructure(JsonXContentGenerator.java:332)
at org.elasticsearch.common.xcontent.XContentBuilder.copyCurrentStructure(XContentBuilder.java:1105)
at org.apache.flume.sink.elasticsearch.ContentBuilderUtil.addComplexField(ContentBuilderUtil.java:63)
at org.apache.flume.sink.elasticsearch.ContentBuilderUtil.appendField(ContentBuilderUtil.java:47)
at org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer.appendHeaders(ElasticSearchJsonBodyEventSerializer.java:48)
at org.jai.flume.sinks.elasticsearch.serializer.ElasticSearchJsonBodyEventSerializer.getContentBuilder(ElasticSearchJsonBodyEventSerializer.java:35)
at org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.addEvent(ElasticSearchTransportClient.java:164)
at org.css.cssElasticsearchSink.process(cssElasticsearchSink.java:121)
What is the cause of this exception ? is it a malformed JSON object ?
Flume version : 1.5.0-cdh5.3.2
Elasticsearch version : "1.5.2"

Cassandra AssertionError

I received an OOM exception at one point in Cassandra. Mine is a single instance running on a modestly powered server, and i was doing some load testing, so no surprise there.
But, i have subsequently been unable to use the instance. When i list the keyspaces, only "system" is shown. But when i try to recreate the keyspace i was testing with, Hector responds with the dreaded "All host pools marked down. Retry burden pushed out to client." message, and the Cassandra log has the following stack trace:
ERROR [MigrationStage:1] 2012-04-27 20:47:00,863 AbstractCassandraDaemon.java (line 134) Exception in thread Thread[MigrationStage:1,5,main]
java.lang.AssertionError
at org.apache.cassandra.db.DefsTable.updateKeyspace(DefsTable.java:441)
at org.apache.cassandra.db.DefsTable.mergeKeyspaces(DefsTable.java:339)
at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:269)
at org.apache.cassandra.service.MigrationManager$1.call(MigrationManager.java:214)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
ERROR [Thrift:9] 2012-04-27 20:47:00,864 CustomTThreadPoolServer.java (line 204) Error occurred during processing of message.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.AssertionError
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:372)
at org.apache.cassandra.service.MigrationManager.announce(MigrationManager.java:191)
at org.apache.cassandra.service.MigrationManager.announceNewKeyspace(MigrationManager.java:129)
at org.apache.cassandra.thrift.CassandraServer.system_add_keyspace(CassandraServer.java:987)
at org.apache.cassandra.thrift.Cassandra$Processor$system_add_keyspace.getResult(Cassandra.java:3370)
at org.apache.cassandra.thrift.Cassandra$Processor$system_add_keyspace.getResult(Cassandra.java:3358)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:186)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.util.concurrent.ExecutionException: java.lang.AssertionError
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:368)
... 11 more
Caused by: java.lang.AssertionError
at org.apache.cassandra.db.DefsTable.updateKeyspace(DefsTable.java:441)
at org.apache.cassandra.db.DefsTable.mergeKeyspaces(DefsTable.java:339)
at org.apache.cassandra.db.DefsTable.mergeSchema(DefsTable.java:269)
at org.apache.cassandra.service.MigrationManager$1.call(MigrationManager.java:214)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
... 3 more
The old keyspace was still in the data dir, so i moved it, but that didn't help. It seems that the system data still has an invalid reference somewhere. Does anyone know how to fix this?
Edit: from the CLI, a "describe cluster;" only describes the "system" keyspace. But when i "use system;" and then "list schema_keyspaces;" the following is displayed:
Using default limit of 100
-------------------
RowKey: mango
=> (column=durable_writes, value=true, timestamp=29127788177516974)
=> (column=name, value=mango, timestamp=29127788177516974)
=> (column=strategy_class, value=org.apache.cassandra.locator.SimpleStrategy, timestamp=29127788177516974)
=> (column=strategy_options, value={"replication_factor":"1"}, timestamp=29127788177516974)
1 Row Returned.
Elapsed time: 1107 msec(s).
"mango" is the keyspace that i can no longer access, but it is still in there to some degree. Is there any way to fix it?
The problem almost certainly is that the recreated keyspace is inconsistent with the commit log or data stored with the original definition. Shut down the Cassandra server and clear out the commitlog, saved_caches, and data directory corresponding to the keyspace. The locations of these directories are in cassandra.yaml - look for data_file_directories, saved_caches_directory, and commitlog_directory.
This problem is due to inconsistency and you can go for following steps.
1) In your case this is OK to clear "data" , "saved_caches" and "commitlog" directories as you dont have any critical data and other Keyspaces.
2) In the scenarios where you have some critical data and you can not delete above mentioned directories do the following.
Use nodetool drain to Empty the commitlog on all the nodes of the cluster.
Then delete all the "LocationInfo*" files from "/data/system" directories and restart the cluster.

Categories

Resources