We have a 3 node Cassandra Cluster running the following version
[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
Node1 stopped communicating with the rest of the cluster this morning, the logs showed this:
ERROR [CompactionExecutor:242] 2020-09-15 19:24:48,753 CassandraDaemon.java:235 - Exception in thread Thread[CompactionExecutor:242,1,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,749 AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread Thread[MutationStage-2,5,main]
ERROR [MutationStage-2] 2020-09-15 19:24:54,771 StorageService.java:466 - Stopping gossiper
ERROR [MutationStage-2] 2020-09-15 19:24:56,791 StorageService.java:476 - Stopping native transport
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,541 LogTransaction.java:277 - Transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377] indicates txn was not completed, trying to abort it now
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,545 LogTransaction.java:280 - Failed to abort transaction log [md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log in /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377]
ERROR [CompactionExecutor:242] 2020-09-15 19:24:58,566 LogTransaction.java:225 - Unable to delete /mnt/cass-a/data/system/local-7ad54392bcdd35a684174e047860b377/md_txn_compaction_c2dbca00-f780-11ea-95eb-cf88b1cae05a.log as it does not exist, see debug log file for stack trace
Cassandra starts up fine on the "broken node", but refuses to rejoin the cluster.
When I do a nodetool status I get this:
**Error: The node does not have system_traces yet, probably still bootstrapping**
Gossip is not running, i've tried disabling and re-enabling, no joy.
I've also tried both a repair and a rebuild, both came back with no errors at all.
Any and all help would be appreciated.
Thanks.
The symptoms you described indicates to me that the node had some form of hardware failure and the data/ disk is possibly inaccessible.
In instances like this, the disk failure policy in cassandra.yaml kicked in:
disk_failure_policy: stop
This would explain why gossip is unavailable (on default port 7000) and the node would not be accepting any client connections either (on default CQL port 9042).
If there is an impending hardware failure, there's a good chance the disk/volume is mounted as read-only. There's also the possibility that the disk is full. Check the operating system logs for clues and you will likely need to escalate the issue to your sysadmin team. Cheers!
Related
I'm using Cassandra on a CentOS machine. After it failed some time ago, I restarted it via
sudo service cassandra restart
and started getting Connection refused error all over the place - I couldn't even run nodetool status without running into that issue.
After some digging and subsequent restarts, I noticed in the debug.log that the startup sequence gets stuck at the following:
INFO [main] 2018-04-03 09:40:15,156 ColumnFamilyStore.java:389 - Initializing system.IndexInfo
INFO [SSTableBatchOpen:1] 2018-04-03 09:40:15,851 BufferPool.java:226 - Global buffer pool is enabled, when pool is exahusted (max is 512 mb) it will allocate on heap
DEBUG [SSTableBatchOpen:1] 2018-04-03 09:40:15,873 SSTableReader.java:479 - Opening <path>/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-300-big (58 bytes)
DEBUG [SSTableBatchOpen:2] 2018-04-03 09:40:15,873 SSTableReader.java:479 - Opening <path>3/system/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-301-big (59 bytes)
DEBUG [SSTableBatchOpen:3] 2018-04-03 09:40:15,873 SSTableReader.java:479 - Opening <path>/IndexInfo-9f5c6374d48532299a0a5094af9ad1e3/mc-299-big (302 bytes)
Opening these files should take about a second - it's stuck on it for ages (as in, it never moved beyond this point). I suspect that some of the files involved must be corrupted (although I'm surprised the Java doesn't catch some sort of an exception here).
What should I do? If I delete these folders, would that result in me losing data? What other diagnostics can I run to establish the source of the problem? For the record, any sort of nodetool command exits with a "Connection Refused" error.
Version numbers:
Cassandra: 3.0.9
Java: 1.8.0_162
CentOS: 6.9
Thanks for help!
It turns out that the issue was in the files involved becoming corrupted - running touch on all the files in the data folder (Data, CompressionInfo, Index, etc.) and erasing the post-crash commitlogs allowed the Cassandra to get up. A few hundred datarows were lost (probably due to me deleting the commitlogs), but at least the database is back up!
I'm trying to recover the connection in RMQ for Clustered environment but unfortunately i'm not able to recover it in code and its also not catching in my exception.
For example. Initially node 1 is connected and our messages has been flow successfully and to test fail-over, we brought up node 2 and stopped node 1.. connections are being lost which is expected, but retry is not happening as node 2 is up.
When i restart my service, i'm able to get exception as:
"Rabbit MQ Message Exception : Error = 'connection is already closed due to
connection error; cause: java.net.SocketException: Connection reset'"
Can anyone please suggest how to recover it in such case?
Have used below configurations in my code. (AMQP client):
factory.setAutomaticRecoveryEnabled(true);
factory.setNetworkRecoveryInterval(5000);
factory.setTopologyRecoveryEnabled(true);
factory.setRequestedHeartbeat(60);
By using Lyra connection recovery will occur with following config:
.withRetryPolicy(new RetryPolicy()
.withMaxAttempts(30)
.withInterval(Duration.seconds(1))
.withMaxDuration(Duration.minutes(5)));
I have a strange issue. Until 2 days back, my setup has been working for almost 2 years.
the setup that always worked :
1. I use postgress 9.3 for development on my laptop.
2. I use intellij idea database tools to browse data
3. sometimes I use pgadmin
4. I run maven build of our application
for the past 2 days, when I run maven install , build fails with
[main] ERROR SqlExceptionHelper.logExceptions(146) | Cannot create PoolableConnectionFactory (The connection attempt failed.)
but, my idea client, pgadmin have no problem connecting.
pg log has the following :
2015-11-06 11:08:02 CST LOG: could not receive data from client: An operation was attempted on something that is not a socket.
2015-11-06 11:08:02 CST LOG: incomplete startup packet
2015-11-06 11:08:17 CST LOG: could not receive data from client: An operation was attempted on something that is not a socket.
2015-11-06 11:08:17 CST LOG: incomplete startup packet
2015-11-06 11:08:35 CST WARNING: pgstat wait timeout
2015-11-06 11:08:46 CST WARNING: pgstat wait timeout
2015-11-06 11:09:35 CST WARNING: pgstat wait timeout
2015-11-06 11:09:45 CST WARNING: pgstat wait timeout
I also disabled firewall and tried. It still fails.
any hints on what could be happening ?
It looks like something's wrong with the windows TCP/IP stack.
See:
https://support.microsoft.com/en-us/kb/817571
PostgreSQL error: could not receive data from client: An operation was attempted on something that is not a socket
The mailing list thread http://www.postgresql.org/message-id/AANLkTimgyax1LUU85caP2FfwKTWkdAwDIE3S7zI-6oea#mail.gmail.com
Possible actions include:
Completely uninstall your antivirus product or (worse) "internet security" suite if you have one. Disabling it is not likely to be enough. You can reinstall or install a better one once you've confirmed it's the cause.
Do a windows TCP/IP stack reset. Exact steps depend on your Windows version, which you did not mention, so you might need to look around a bit. This might have side effects like clearing settings that break other applications. You should know what you're doing, be willing to accept possible problems, or get professional support from a tech with Windows skills.
Check the system for possible rootkit or undetected virus activity
Uninstall any unnecessary drivers for 3G modems, cable providers, etc
If you installed any software recently, remove it
I am using apache 2.4 on fedora and when i am trying to open my page, it is showing completely blank. I checked the apache logs, there are many lines showing ([core:notice] [pid 1483] AH00052: child pid 1486 exit signal Segmentation fault (11)).
I have no idea what is causing this error and getting blank page is because of this only or not.
Thanks in advance
I saw this on a server configured to require client certificates to grant access to the site. The cause was an expired CRL, and installing a new CRL fixed the issue.
The reason becomes apparent by increasing the log level for the relevant vhost to info. By default it only logs error or worse, which only produces a cryptic message about failure to re-negotiate a handshake.
Today I managed to recreate the farms with Scalr.net and apparently after a few times restarting tomcat and fixing issues, I get this error once again. The thing is I was using MySQL with a clean install on the entire server, that includes Java 6.1_24, Tomcat 5.5.33, Sakai 2.7.1. The issue I keep running into is user denied when the fact that I have this user in the MySQL Instance, as well giving it complete remote access with sakai#% and even this is not working when it was working about an hour ago since this post was made.
... Continued from above log, everything before logs just fine
2011-03-31 18:31:14,120 WARN main org.springframework.jdbc.datasource.LazyConnectionDataSourceProxy - Could not retrieve default auto-commit and transaction isolation settings
org.apache.commons.dbcp.SQLNestedException: Error preloading the connection pool
... continued over 400+ lines...
Here is another error in regards to the access denied error...
2011-03-31 18:31:16,854 WARN main org.hibernate.cfg.SettingsFactory - Could not obtain connection metadata
java.sql.SQLException: Access denied for user 'sakai'#'ec2-50-17-184-70.compute-1.amazonaws.com' (using password: YES)
.... continued....
I now get this error whenever I startup, this is with a fresh install of tomcat/sakai
SEVERE: Unable to set localhost. This prevents creation of a GUID. Cause was: ec2-72-44-56-167.compute-1.amazonaws.com: ec2-72-44-56-167.compute-1.amazonaws.com
java.net.UnknownHostException: ec2-72-44-56-167.compute-1.amazonaws.com: ec2-72-44-56-167.compute-1.amazonaws.com
(This most recent error (Localhost) was simply fixed by restarting the amazon aws instance. Thankfully) Although I keep getting the same errors even with a fresh install... Almost as if the information is being refreshed from a cache... Or something
As with the last question you posted on this topic, the error message seems very clear: the user 'sakai'#... does not have access to login to the database you have set it up to. I recommend taking a look at the Mysql documentation to understand how to administer the user accounts to find out if you've missed a setting somewhere to allow this account to have access.
I believe I may have figured out how to fix this problem. It has nothing to do with mysql, or the apache server itself. It has to do with the failure of Scalr.net not Initializing the IP or something of that sort. After doing some research I found some issues with the HostInit issues such as....
Cannot deliver message 'HostInit' (message_id: af9dcfdb-a09e-4971-bdb7-7871b3f7e21c) via REST to server '50.17.135.98' (server_id: e49cfec9-5bcb-44d1-bbc5-fde32450fc89). Error: 0 Timeout was reached; connect() timed out! (http://50.17.135.98:8013/control)
Cannot deliver message 'BlockDeviceAttached' (message_id: a153d83f-3d96-4d53-920a-ccb80701675a) via REST to server '50.17.135.98' (server_id: e49cfec9-5bcb-44d1-bbc5-fde32450fc89). Error: 0 Timeout was reached; connect() timed out! (http://50.17.135.98:8013/control)
Cannot deliver message 'HostUp' (message_id: 1adde27c-9982-4551-b266-c3c432d1dd44) via REST to server '50.17.135.98' (server_id: e49cfec9-5bcb-44d1-bbc5-fde32450fc89). Error: 0 Timeout was reached; connect() timed out! (http://50.17.135.98:8013/control)
Cannot deliver message 'HostInit' (message_id: f1aa4b14-ef57-4361-ae56-87702d674b11) via REST to server '50.17.135.98' (server_id: e49cfec9-5bcb-44d1-bbc5-fde32450fc89). Error: 0 Timeout was reached; connect() timed out! (http://50.17.135.98:8013/control)
So what I did was I made a snapshot image of the apache server/mysql etc. and terminated them allowing the recreation of the instance and this managed to solve the problem in one manner.