I'm using Cassandra on a CentOS machine. After it failed some time ago, I restarted it via
sudo service cassandra restart
and started getting Connection refused error all over the place - I couldn't even run nodetool status without running into that issue.
After some digging and subsequent restarts, I noticed in the debug.log that the startup sequence gets stuck at the following:
INFO [main] 2018-04-03 09:40:15,156 ColumnFamilyStore.java:389 - Initializing system.IndexInfo
INFO [SSTableBatchOpen:1] 2018-04-03 09:40:15,851 BufferPool.java:226 - Global buffer pool is enabled, when pool is exahusted (max is 512 mb) it will allocate on heap
DEBUG [SSTableBatchOpen:1] 2018-04-03 09:40:15,873 SSTableReader.java:479 - Opening <path>/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-300-big (58 bytes)
DEBUG [SSTableBatchOpen:2] 2018-04-03 09:40:15,873 SSTableReader.java:479 - Opening <path>3/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-301-big (59 bytes)
DEBUG [SSTableBatchOpen:3] 2018-04-03 09:40:15,873 SSTableReader.java:479 - Opening <path>/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-299-big (302 bytes)
Opening these files should take about a second - it's stuck on it for ages (as in, it never moved beyond this point). I suspect that some of the files involved must be corrupted (although I'm surprised the Java doesn't catch some sort of an exception here).
What should I do? If I delete these folders, would that result in me losing data? What other diagnostics can I run to establish the source of the problem? For the record, any sort of nodetool command exits with a "Connection Refused" error.
Version numbers:
Cassandra: 3.0.9
Java: 1.8.0_162
CentOS: 6.9
Thanks for help!
It turns out that the issue was in the files involved becoming corrupted - running touch on all the files in the data folder (Data, CompressionInfo, Index, etc.) and erasing the post-crash commitlogs allowed the Cassandra to get up. A few hundred datarows were lost (probably due to me deleting the commitlogs), but at least the database is back up!
Related
We have a 3 node Cassandra Cluster running the following version
[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
Node1 stopped communicating with the rest of the cluster this morning, the logs showed this:
ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - Exception in thread Thread[CompactionExecutor:242,1,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread Thread[MutationStage-2,5,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - Stopping gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - Stopping native transport
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - Transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377] indicates txn was not completed, trying to abort it now
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - Failed to abort transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - Unable to delete /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log as it does not exist, see debug log file for stack trace
Cassandra starts up fine on the "broken node", but refuses to rejoin the cluster.
When I do a nodetool status I get this:
**Error: The node does not have system_traces yet, probably still bootstrapping**
Gossip is not running, i've tried disabling and re-enabling, no joy.
I've also tried both a repair and a rebuild, both came back with no errors at all.
Any and all help would be appreciated.
Thanks.
The symptoms you described indicates to me that the node had some form of hardware failure and the data/ disk is possibly inaccessible.
In instances like this, the disk failure policy in cassandra.yaml kicked in:
disk_failure_policy: stop
This would explain why gossip is unavailable (on default port 7000) and the node would not be accepting any client connections either (on default CQL port 9042).
If there is an impending hardware failure, there's a good chance the disk/volume is mounted as read-only. There's also the possibility that the disk is full. Check the operating system logs for clues and you will likely need to escalate the issue to your sysadmin team. Cheers!
Using Debezium 0.7 to read from MySQL but getting flush timeout and OutOfMemoryError errors in the initial snapshot phase. Looking at the logs below it seems like the connector is trying to write too many messages in one go:
WorkerSourceTask{id=accounts-connector-0} flushing 143706 outstanding messages for offset commit [org.apache.kafka.connect.runtime.WorkerSourceTask]
WorkerSourceTask{id=accounts-connector-0} Committing offsets [org.apache.kafka.connect.runtime.WorkerSourceTask]
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
WorkerSourceTask{id=accounts-connector-0} Failed to flush, timed out while waiting for producer to flush outstanding 143706 messages [org.apache.kafka.connect.runtime.WorkerSourceTask]
Wonder what the correct settings are http://debezium.io/docs/connectors/mysql/#connector-properties for sizeable databases (>50GB). I didn't have this issue with smaller databases. Simply increasing the timeout doesn't seem like a good strategy. I'm currently using the default connector settings.
Update
Changed the settings as suggested below and it fixed the problem:
OFFSET_FLUSH_TIMEOUT_MS: 60000 # default 5000
OFFSET_FLUSH_INTERVAL_MS: 15000 # default 60000
MAX_BATCH_SIZE: 32768 # default 2048
MAX_QUEUE_SIZE: 131072 # default 8192
HEAP_OPTS: '-Xms2g -Xmx2g' # default '-Xms1g -Xmx1g'
This is a very complex question - first of all, the default memory settings for Debezium Docker images are quite low so if you are using them it might be necessary to increase them.
Next, there are multiple factors at play. I recommend to do follwoing steps.
Increase max.batch.size and max.queue.size - reduces number of commits
Increase offset.flush.timeout.ms - gives Connect time to process accumulated records
Decrease offset.flush.interval.ms - should reduce the amount of accumulated offsets
Unfortunately there is an issue KAFKA-6551 lurking in backstage that can still play a havoc.
I can confirm that the answer posted above by Jiri Pechanec solved my issues. This is the configurations I am using:
kafka connect worker configs set in worker.properties config file:
offset.flush.timeout.ms=60000
offset.flush.interval.ms=10000
max.request.size=10485760
Debezium configs passed through the curl request to initialize it:
max.queue.size = 81290
max.batch.size = 20480
We didn't run into this issue with our staging MySQL db (~8GB), because the dataset is a lot smaller. For production dataset (~80GB) , we had to adjust these configurations.
Hope this helps.
To add onto what Jiri said:
There is now an open issue in the Debezium bugtracker, if you have any more information about root causes, logs or reproduction, feel free to provide them there.
For me, changing the values that Jiri mentioned in his comment did not solve the issue. The only working workaround was to create multiple connectors on the same worker that are responsible for a subset of all tables each. For this to work, you need to start connector 1, wait for the snapshot to complete, then start connector 2 and so on. In some cases, an earlier connector will fail to flush when a later connector starts to snapshot. In those cases, you can just restart the worker once all snapshots are completed and the connectors will pick up from the binlog again (make sure your snapshot mode is "when_needed"!).
I am getting this error, trying to launch Cassandra. I ran rm -rf * on /var/log/cassandra and /var/lib/cassandra, and try to run cassandra again, with no success. Any idea? I have been looking to similar cases, but not at the launch of Cassandra, and nothing I found helped me to solve this problem.
root#test # cassandra
(...)
16:38:38.551 [MemtableFlushWriter:1] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[MemtableFlushWriter:1,5,main]
java.lang.RuntimeException: Insufficient disk space to write 572 bytes
at org.apache.cassandra.db.Directories.getWriteableLocation(Directories.java:349) ~[apache-cassandra-2.2.7.jar:2.2.7]
at org.apache.cassandra.db.Memtable.flush(Memtable.java:324) ~[apache-cassandra-2.2.7.jar:2.2.7]
at org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1165) ~[apache-cassandra-2.2.7.jar:2.2.7]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_65]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_65]
16:38:41.785 [ScheduledTasks:1] INFO o.a.cassandra.locator.TokenMetadata - Updating topology for all endpoints that have changed
FYI, if I try to run cassandra again just after, I got
[main] ERROR o.a.c.service.CassandraDaemon - Port already in use: 7199; nested exception is:
java.net.BindException: Address already in use
So it seems that "CassandraDaemon" is alive; but if I want to run cqlsh I got this error:
root#test # cqlsh
Warning: custom timestamp format specified in cqlshrc, but local timezone could not be detected.
Either install Python 'tzlocal' module for auto-detection or specify client timezone in your cqlshrc.
Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
Finally, the free -m command gives me
total used free shared buff/cache available
Mem: 15758 8308 950 1440 6499 5576
Swap: 4095 2135 1960
Thanks for you help!
EDIT : Here are the WARN message I got during Cassandra launch:
11:36:14.622 [main] WARN o.a.c.config.DatabaseDescriptor - Small commitlog volume detected at /var/lib/cassandra/commitlog; setting commitlog_total_space_in_mb to 2487. You can override this in cassandra.yaml
11:36:14.623 [main] WARN o.a.c.config.DatabaseDescriptor - Only 17 MB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots
11:36:14.943 [main] WARN o.a.cassandra.service.StartupChecks - jemalloc shared library could not be preloaded to speed up memory allocations
11:36:14.943 [main] WARN o.a.cassandra.service.StartupChecks - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info.
11:36:14.943 [main] WARN o.a.cassandra.service.StartupChecks - OpenJDK is not recommended. Please upgrade to the newest Oracle Java release
11:36:14.954 [main] WARN o.a.cassandra.utils.SigarLibrary - Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : false, nproc limit adequate? : true
As I said in a previous comment, it seems that the issue was triggered by the /var folder reaching its limit size, because Nifi, that I used for ingesting data, generates big logs in my configuration. Therefore removing useless logs and keeping an eye on the /var folder seems to prevent the error.
I am not getting while EARs are undeployed automatically in jboss-as-7.1.1.Final.
I can see these logs:
ERROR org.apache.tomcat.util.net.JIoEndpoint$Acceptor [run] Socket accept failed: java.net.SocketException: Too many open files
WARN com.kpn.tie.ejbs.dao.webservice.tt.WebServiceProcessor [invoke] WebService unavailable. The request could not be completed due to technical problems. ; nested exception is: java.net.SocketException: Too many open files
Can somebody tell me root cause of this behavior and also suggest solution for this.
For workaround, restarting jboss in particular time interval will resolve this issue?
The reason could be that the application is overloaded or the file descriptor settings is too low. Due to this, the JVM can not open any new file handle, so you are getting Socket accept failed for incoming requests.
After a while the Deployment-Scanner comes into play (5 sec is default) and tries to check the deployments folder, which is not possible as it can not open any file-handle. So it gets confused and stops the deployed apps.
First solution could be:
Deactivate the scanner so that it only checks once during boot or remove the deployment scanner subsystem and use only CLI to deploy.
Second solution could be:
Increase the file-handler limit (open files size)
java.net.SocketException: Too many open files
On Linux you can increase the number of concurrently open files with
ulimit -n 2048
This would allow 2048 open at the same time in the current session. The command should be either inserted in the session configuration (e.g. .bashrc or similar, depends on your used shell) or in the JBoss start script.
To show the current limit you can use
ulimit -n
I connect to RabbitMQ server that time my connection display in blocking state and i not able to publish new message
i have ram of 6 gb free and disk space also be about 8GB
how to configure disk space limit in RabbitMQ
I got the same problem. Seem like the rabbitmq server was using more memory than the threshold
http://www.rabbitmq.com/memory.html
I ran following command to unblock these connections:
rabbitmqctl set_vm_memory_high_watermark 0.6
(default value is 0.4)
By default, [disk_free_limit](source: [1]) must exceed 1.0 times of RAM available. This is true in your case so you may want to check what exactly is blocking the flow. To do that, read [rabbitmqctl man](source: [2]), and run the last_blocked_by command. That should tell you the cause for blocking.
Assuming it is memory (and you somehow didn't calculate your free disk space correctly), to change disk_free_limit, read [configuring rabbitmq.config](source: [1]), then open your rabbitmq.config file and add the following line: {rabbit, [{disk_free_limit, {mem_relative, 0.1}}]} inside the config declaration. My rabbitmq.config file looks as follows:
[
{rabbit, [{disk_free_limit, {mem_relative, 0.1}}]}
].
The specific number is up to you, of course.
Sources
http://www.rabbitmq.com/configure.html#configuration-file
http://www.rabbitmq.com/man/rabbitmqctl.1.man.html