Tomcat jdbc stalls during amazon multi-az failover

Tomcat jdbc stalls during amazon multi-az failover - java

SOLVED
Ultimately the solution noted in the similar ticket noted below worked for us. When we tried it initially our configuration parser was mangling the URL and removing &connectTimeout=15000&socketTimeout=60000 from it, which invalidated that test.
I'm having trouble getting a tomcat jdbc connection pool to fail-over to a new DB server using Amazon RDS' mutli-az feature. When fail-over occurs the application server gets hung up trying to borrow a connection from the pool. It's similar to this question, however the solution to this users problem did not help me, so I suspect it's not quite the same: Configure GlassFish JDBC connection pool to handle Amazon RDS Multi-AZ failover
The sequence goes a bit like this:
Succesful request is made, log output as expected
fail-over initiated (via reboot with failover in RDS)
Request made that times out, log messages from request appear as expected up until a connection is borrowed from the pool.
Subsequent requests generate no log messages, they time out as well.
After some amount of time, the daemon will eventually start printing more log messages as if it succesfully connected to the database and performing operations. This can take just over 16 minutes to occur, the client has long since timed out.
If I wait about 50 minutes and try again eventually the system will finally accept connections again.
Notes
If I tell tomcat to shut down there are exceptions in the logs about it being unable to clean up resources, and about my servlet still processing a request. Most notable (In my opinion) is the following stack trace indicating at least one thing is stuck waiting for communication over a socket from mysql.
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:114)
com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:161)
com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:189)
com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3116)
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3573)
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3562)
com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4113)
com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2570)
com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2731)
com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2812)
com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2761)
com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:894)
com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:732)
org.apache.tomcat.jdbc.pool.PooledConnection.validate(PooledConnection.java:441)
org.apache.tomcat.jdbc.pool.PooledConnection.validate(PooledConnection.java:384)
org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:716)
org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:579)
org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:174)
org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:111)
<...>
If I restart tomcat things return to normal as soon as it comes back. Obviously long term this is probably not a workable solution for maintaining uptime.
Environment Details
Database: MySQL (using mysql-connector-java 5.1.26) (server 5.5.53)
Tomcat 8.0.45
I've gone through several changes to my configuration while trying to solve this issue. At the time of this writing the following related settings are in place:
jre/lib/security/java.security -- I'm under the impression that the default value for Oracle Java 1.8 is 30s with no security manager. I set these to zero just to be positive this isn't the issue.
networkaddress.cache.ttl: 0
networkaddress.cache.negative.ttl: 0
connection pool settings
testOnBorrow: true
testOnConnect: true
testOnReturn: true
jdbc url parameters
connectTimeout:60000
socketTimeout:60000
autoReconnect:true
Update
Still no solution found.
Added in logging to confirm that this was not a DNS caching issue. IP address logged immediately before timeout matches up with IP address of 'new' master RDS host.
For reference the following block represents the properties I'm using to initialize my data source. I’m configuring the pool in code rather than JNDI, with some elements pulled out of our app's config file. I’ve pasted the code below along with comments indicating what the config values are for the tests I’ve been running.
PoolProperties p = new PoolProperties();
p.setUrl(config.get("JDBC_URL")); // jdbc:mysql://host:3306/dbname
p.setDriverClassName("com.mysql.jdbc.Driver");
p.setUsername(config.get("JDBC_USER"));
p.setPassword(config.get("JDBC_PASSWORD"));
p.setJmxEnabled(true);
p.setTestWhileIdle(false);
p.setTestOnBorrow(true);
p.setValidationQuery("SELECT 1");
p.setValidationInterval(30000);
p.setTimeBetweenEvictionRunsMillis(30000);
p.setMaxActive(Integer.parseInt(config.get("MAX_ACTIVE"))); //45
p.setInitialSize(10);
p.setMaxWait(5);
p.setRemoveAbandonedTimeout(Integer.parseInt(config.get("REMOVE_ABANDONED_TIMEOUT"))); //600
p.setMinEvictableIdleTimeMillis(Integer.parseInt(config.get("DB_EVICTION_TIMEOUT"))); //60000
p.setMinIdle(Integer.parseInt(config.get("DB_MIN_IDLE"))); //50
p.setLogAbandoned(Boolean.parseBoolean(config.get("LOG_ABANDONED"))); //true
p.setRemoveAbandoned(true);
p.setJdbcInterceptors(
"org.apache.tomcat.jdbc.pool.interceptor.ConnectionState;"+
"org.apache.tomcat.jdbc.pool.interceptor.StatementFinalizer;"+
"org.apache.tomcat.jdbc.pool.interceptor.ResetAbandonedTimer");
// make sure new connections have auto commit true
p.setDefaultAutoCommit(true);

Related

Postgres query run executed using hibernate getting dropped if the query takes a long time without any timeout exception

I am running a postgres query that takes more than two hours.
This query is executed using hibernate in a java program.
After about 1.5 hours the query stops showing up in the server status in pg_admin.
Since, the query disappeared from the list of active queries on the database, I am expecting a success or a timeout exception. But, I get none.(No exception) and my thread in stuck in the wait state.
I know the query has not finished because it was supposed to do some inserts in a table and I cannot find the expected rows in the table.
I am using pgbouncer for the connection pooling and the query_timeout is disabled.
Had it been a hibernate timeout I should have got an exception.
OS parameters on the DB machine and Client machine(Machine running java program)
tcp_keepalive_time is 7200 (seconds)
tcp_keepalive_intvl = 75
tcp_keepalive_probes = 9 (number of probes)
Both the machines run RHEL operating system.
I am unable to put my finger on the issue.

I found that the issue was caused due to the TCP connection getting dropped and the client still hanging waiting for the response.
I altered the following parameters at OS level:-
/proc/sys/net/ipv4/tcp_keepalive_time = 2700
Default value was 7200.
This causes a keep alive check at every 2700 seconds instead of 7200 seconds.

I am sure you would have already looked at the following resources:
PostgreSQL Timeout Docs
PgBouncer timeout (you already mention).
Hibernate timeout parameters, if any.
Once that is done, (just like triaging permission issues during a new installation, ) I recommend that you try the following SQL, from different scenarios (given below) and ascertain what is actually causing this timeout:
SELECT pg_sleep(7200);
Login to the server (via psql) and see whether this SQL times-out.
Login to the PgBouncer (again via psql) and see whether PgBouncer times out.
Execute this SQL via Hibernate (via PgBouncer), and see whether there is a timeout.
This should allow you to clearly isolate the cause for this.

Unable to fill pool (no buffer space available)

I'm using Wildfly 8.2 and fire a series of DB requests when a certain web page is opened. All queries are invoked thru JPA Criteria API, return results as expected - and - none of them delivers a warning, error or exception. It all runs in Parallel Plesk.
Now, I noticed that within 2 to 3 days the following error appears and the site becomes unresponsive. I restart and I wait approx another 3 days till it happens again (depending on the number of requests I have).
I checked the tcpsndbuf on my linux server and I noticed it is constantly at max. Unless I restart Wildfly. Apparently it fails to release the connections.
The connections are managed by JPA/Hibernate and the Wildfly container. I don't do any special or custom transaction handling e.g. open, close. etc. I leave it all to Wildfly.
The MySQL Driver I'm using is 5.1.21 (mysql-connector-java-5.1.21-bin.jar)
In the standalone.xml I have defined the following datasource datasource values (among others):
<transaction-isolation>TRANSACTION_READ_COMMITTED</transaction-isolation>
<pool>
<min-pool-size>3</min-pool-size>
<max-pool-size>10</max-pool-size>
</pool>
<statement>
<prepared-statement-cache-size>32</prepared-statement-cache-size>
<shared-prepared-statements>true</shared-prepared-statements>
</statement
Has anyone experience the same rise of tcpsndbuf values (or this error)? In case you require more config or log files, let me know. Thanks!
UPDATE
Despite the following additional timeout settings, it still runs into the hanger. And thus, it will then use 100% CPU time, whenever the max tcpsndbuf is reached.,

Try adding this Hibernate property:
<property name="hibernate.connection.release_mode">after_transaction</property>
By default, JTA mandates that connection should be released after each statement, which is undesirable for most use cases. Most Drivers don't allow multiplexing a connection over multiple XA transactions anyway.

Do you use openvz? I think this question should be asked on serverfault. It is related to linux configuration. You can read: tcpsndbuf. You should count opened sockets and check condition:

java.net.SocketTimeoutException: Read timed out under Tomcat

I have a Tomcat based web application. I am intermittently getting the following exception,
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at org.apache.coyote.http11.InternalInputBuffer.fill(InternalInputBuffer.java:532)
at org.apache.coyote.http11.InternalInputBuffer.fill(InternalInputBuffer.java:501)
at org.apache.coyote.http11.InternalInputBuffer$InputStreamInputBuffer.doRead(InternalInputBuffer.java:563)
at org.apache.coyote.http11.filters.IdentityInputFilter.doRead(IdentityInputFilter.java:124)
at org.apache.coyote.http11.AbstractInputBuffer.doRead(AbstractInputBuffer.java:346)
at org.apache.coyote.Request.doRead(Request.java:422)
at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:290)
at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:431)
at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:315)
at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:200)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
Unfortunately I don't have access to the client, so I am just trying to confirm on various reasons this can happen,
Server is trying to read data from the request, but its taking longer than the timeout value for the data to arrive from the client. Timeout here would typically be Tomcat connector → connectionTimeout attribute.
Client has a read timeout set, and server is taking longer than that to respond.
One of the threads I went through, said this can happen with high concurrency and if the keepalive is enabled.
For #1, the initial value I had set was 20 sec, I have bumped this up to 60sec, will test, and see if there are any changes.
Meanwhile, if any of you guys can provide you expert opinion on this, that'l be really helpful. Or for that matter any other reason you can think of which might cause this issue.

Server is trying to read data from the request, but its taking longer than the timeout value for the data to arrive from the client. Timeout here would typically be tomcat connector -> connectionTimeout attribute.
Correct.
Client has a read timeout set, and server is taking longer than that to respond.
No. That would cause a timeout at the client.
One of the threads i went through, said this can happen with high concurrency and if the keepalive is enabled.
That is obviously guesswork, and completely incorrect. It happens if and only if no data arrives within the timeout. Period. Load and keepalive and concurrency have nothing to do with it whatsoever.
It just means the client isn't sending. You don't need to worry about it. Browser clients come and go in all sorts of strange ways.

Here are the basic instructions:-
Locate the "server.xml" file in the "conf" folder beneath Tomcat's base directory (i.e. %CATALINA_HOME%/conf/server.xml).
Open the file in an editor and search for <Connector.
Locate the relevant connector that is timing out - this will typically be the HTTP connector, i.e. the one with protocol="HTTP/1.1".
If a connectionTimeout value is set on the connector, it may need to be increased - e.g. from 20000 milliseconds (= 20 seconds) to 120000 milliseconds (= 2 minutes). If no connectionTimeout property value is set on the connector, the default is 60 seconds - if this is insufficient, the property may need to be added.
Restart Tomcat

Connection.Response resp = Jsoup.connect(url) //
.timeout(20000) //
.method(Connection.Method.GET) //
.execute();
actually, the error occurs when you have slow internet so try to maximize the timeout time and then your code will definitely work as it works for me.

I had the same problem while trying to read the data from the request body. In my case which occurs randomly only to the mobile-based client devices. So I have increased the connectionUploadTimeout to 1min as suggested by this link

I have the same issue. The java.net.SocketTimeoutException: Read timed out error happens on Tomcat under Mac 11.1, but it works perfectly in Mac 10.13. Same Tomcat folder, same WAR file. Have tried setting timeout values higher, but nothing I do works.
If I run the same SpringBoot code in a regular Java application (outside Tomcat 9.0.41 (tried other versions too), then it works also.
Mac 11.1 appears to be interfering with Tomcat.
As another test, if I copy the WAR file to an AWS EC2 instance, it works fine there too.
Spent several days trying to figure this out, but cannot resolve.
Suggestions very welcome! :)

This happenned to my application, actually I was using a single Object which was being called by multiple functions and those were not thread safe.
Something like this :
Class A{
Object B;
function1(){
B.doSomething();
}
function2(){
B.doSomething();
}
}
As they were not threadsafe, I was getting these errors :
redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketException: Socket is closed
and
redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: Read timed out
This is how I fixed it :
Class A{
function1(){
Object B;
B.doSomething();
}
function2(){
Object B;
B.doSomething();
}
}
Hope it helps

It means time out from your server response. It causes due to server config and internet response.

I am using 11.2 and received timeouts.
I resolved by using the version of jsoup below.
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.2</version>
<scope>compile</scope>
</dependency>

How to reconnect when the LDAP server is restarted?

I have a situation where through a Java program, I create a javax.naming.ldap.LdapContext and do a search() operation on it - which makes an underlying connection. Then I put the Java app thread to sleep, during which I restart the LDAP server (OpenLDAP, just to note). When the App thread wakes up and tries to do any operation on the LdapContext created earlier, it throws "CommunicationException: Connection is closed".
What I want is to be able to re-establish the connection.
I see that LdapContext has a reconnect() method - where I pass controls as null. However, this does not have any effect. What I saw in the Sun LDAP implementation that during the time when the LDAP server was restarted, the ConnectionPool maintained by the Sun implementation marked the underlying com.sun.jndi.ldap.LdapClient instance with a "usable=false". Upon reconnect() call - it simply calls ensureOpen(), which again checks if the usable flag is false or not - if it's false; then it throws CommunicationException - so back to square one.
My question is: how does a Java app survive an external LDAP server restart? Is creation of new LdapContext again is the only way out?
Appreciate any insights.
Here is the stacktrace of the exception:
javax.naming.CommunicationException: connection closed [Root exception is java.io.IOException: connection closed]; remaining name 'uid=foo,ou=People,dc=example,dc=com'
at com.sun.jndi.ldap.LdapCtx.doSearch(LdapCtx.java:1979)
at com.sun.jndi.ldap.LdapCtx.searchAux(LdapCtx.java:1824)
at com.sun.jndi.ldap.LdapCtx.c_search(LdapCtx.java:1749)
at com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(ComponentDirContext.java:368)
at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(PartialCompositeDirContext.java:338)
at com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(PartialCompositeDirContext.java:321)
at javax.naming.directory.InitialDirContext.search(InitialDirContext.java:248)
Caused by: java.io.IOException: connection closed
at com.sun.jndi.ldap.LdapClient.ensureOpen(LdapClient.java:1558)
at com.sun.jndi.ldap.LdapClient.search(LdapClient.java:504)
at com.sun.jndi.ldap.LdapCtx.doSearch(LdapCtx.java:1962)
... 26 more

Just enable JNDI connection pooling and it will all be taken care of for you behind the scenes. See the JNDI Guide to Features and the LDAP Provider documentation. It's controlled by just a couple of properties.

The UnboundID LDAP SDK provides a means to auto-connect wherein that auto-reconnect operation is invisible to the client.

We had this problem at work. The solution we came up with (may not be the best answer). Was to create a watchdog thread that would check the connection at some fixed rate. If the connection did not work, it would re-initialize the connection with LDAP.

You should note that this is related essentially to LDAP connection pooling. As defined here:
A connection is retrieved from the pool, used, returned to the pool, and then, retrieved again from the pool for another Context instance.
Thus, the reuse of a previous connection may cause such problem:
You may test the behavior without using LDAP connection pooling by setting
com.sun.jndi.ldap.connect.pool=false
Also, another possible cause may be the timeout of reading the LDAP operations. In fact, the reading operation is not notified about the closure of the LDAP server after a specific timeout. For more information, you may take a look at this link

How to reestablish a JDBC connection after a timeout?

I have a long-running method which executes a large number of native SQL queries through the EntityManager (TopLink Essentials). Each query takes only milliseconds to run, but there are many thousands of them. This happens within a single EJB transaction. After 15 minutes, the database closes the connection which results in following error:
Exception [TOPLINK-4002] (Oracle TopLink Essentials - 2.1 (Build b02-p04 (04/12/2010))): oracle.toplink.essentials.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: Closed Connection
Error Code: 17008
Call: select ...
Query: DataReadQuery()
at oracle.toplink.essentials.exceptions.DatabaseException.sqlException(DatabaseException.java:319)
.
.
.
RAR5031:System Exception.
javax.resource.ResourceException: This Managed Connection is not valid as the phyiscal connection is not usable
at com.sun.gjc.spi.ManagedConnection.checkIfValid(ManagedConnection.java:612)
In the JDBC connection pool I set is-connection-validation-required="true" and connection-validation-method="table" but this did not help .
I assumed that JDBC connection validation is there to deal with precisely this kind of errors. I also looked at TopLink extensions (http://www.oracle.com/technetwork/middleware/ias/toplink-jpa-extensions-094393.html) for some kind of timeout settings but found nothing. There is also the TopLink session configuration file (http://download.oracle.com/docs/cd/B14099_19/web.1012/b15901/sessions003.htm) but I don't think there is anything useful there either.
I don't have access to the Oracle DBA tables, but I think that Oracle closes connections after 15 minutes according to the setting in CONNECT_TIME profile variable.
Is there any other way to make TopLink or the JDBC pool to reestablish a closed connection?
The database is Oracle 10g, application server is Sun Glassfish 2.1.1.

All JPA implementations (running on a Java EE container) use a datasource with an associated connection pool to manage connectivity with the database.
The persistence context itself is associated with the datasource via an appropriate entry in persistence.xml. If you wish to change the connection timeout settings on the client-side, then the associated connection pool must be re-configured.
In Glassfish, the timeout settings associated with the connection pool can be reconfigured by editing the pool settings, as listed in the following links:
Changing timeout settings in GlassFish 3.1
Changing timeout settings in GlassFish 2.1
On the server-side (whose settings if lower than the client settings, would be more important), the Oracle database can be configured to have database profiles associated with user accounts. The session idle_time and connect_time parameters of a profile would constitute the timeout settings of importance in this aspect of the client-server interaction. If no profile has been set, then by default, the timeout is unlimited.

Unless you've got some sort of RAC failover, when the connection is terminated, it will end the session and transaction.
The admins may have set into some limits to prevent runaway transactions or a single job 'hogging' a connection in a pool. You generally don't want to lock a connection in a pool for an extended period.
If these queries aren't necessarily part of the same transaction, then you could try terminating and restarting a new connection.
Are you able to restructure your code so that it completes in under 15 minutes. A stored procedure in the background may be able to do the job a lot quicker than dragging the results of thousands of operations over the network.

I see you set your connection-validation-method="table" and is-connection-validation-required="true", but you do not mention that you specified the table you were validating on; did you set validation-table-name="any_table_you_know_exists" and provide any existing table-name? validation-table-name="existing_table_name" is required.
See this article for more details on connection validation.
Related StackOverflow article with similar problem - he wants to flush the entire invalid connection pool.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.