Play framework resource starvation after a few days

Play framework resource starvation after a few days - java

I am experiencing an issue in Play 2.5.8 (Java) where database related service endpoints starts timing out after a few days even though the server CPU & memory usage seems fine. Endpoints that does not access the DB continue to work perfectly.
The application runs on a t2.medium EC2 instance with a t2.medium MySQL RDS, both in the same availability zone. Most HTTP calls do lookups/updates to the database with around 8-12 requests per second, and there are also ±800 WebSocket connections/actors with ±8 requests/second (90% of the WebSocket messages does not access the database). DB operations are mostly simple lookups & updates taking around 100ms.
When using only the default thread pool it took about 2 days to reach the deadlock, and after moving the database requests to a separate thread pool as per https://www.playframework.com/documentation/2.5.x/ThreadPools#highly-synchronous, it improved but only to about 4 days.
This is my current thread config in application.conf:
akka {
actor {
guardian-supervisor-strategy = "actors.RootSupervisionStrategy"
}
loggers = ["akka.event.Logging$DefaultLogger",
"akka.event.slf4j.Slf4jLogger"]
loglevel = WARNING
## This pool handles all HTTP & WebSocket requests
default-dispatcher {
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 64
}
}
db-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 210
}
}
}
Database configuration:
play.db.pool="default"
play.db.prototype.hikaricp.maximumPoolSize=200
db.default.driver=com.mysql.jdbc.Driver
I have played around with the amount of connections in the DB pool & adjusting the size of the default & db-dispatcher pool size but it doesn't seem to make any difference. It feels I'm missing something fundamental about Play's thread pools & configuration as I don't think the load on the server should not be an issue for Play to handle.

After more investigation I found that the issue is not related to thread pool configuration at all, but rather TCP connections that build up due to WS reconnections until the server (or Play framework) cannot accept any more connections. When this happens, only established TCP connections are serviced which mostly includes the established WebSocket connections.
I could not yet determine why the connections are not managed/closed properly.
My issue relates to this question:
Play 2.5 WebSocket Connection Build

Related

gRPC connection cycling

We are setting up a cluster to handle inferencing (with Tensorflow Serving) over gRPC. We intend to use a layer-7 load balancer (AWS ALB) to distribute the load. For our work load, inferencing will occur many times per minute from each client account. It is my understand that gRPC holds connection state for each of these channels. As a result, in order for the ALB to do its job, we need to periodically teardown and rebuild the connection on the client instance.
My question: what is the best practice for cycling a connection in Java?
Below is my proposed code, which would be called every couple minutes on each client channel. I assume that while the first connection is being shutdown, we can go about creating new one and immediately issue a request on it; or do we need to wait while the prior channel is shutdown first. In our situation, the channel will (very likely) be empty since the previous request will have been 10 seconds earlier.
if (mChannel != null)
mChannel.shutdown();
mChannel = ManagedChannelBuilder.forAddress(mHost, mPort).usePlaintext().build();
mStub = PredictionServiceGrpc.newBlockingStub(mChannel);

The best practice is to use Lookaside Load Balancing.
However, you can do few tweaks to terminate client connections.
var builder = ManagedChannelBuilder.forAddress(mHost, mPort)
.keepAliveTime(15, TimeUnit.SECONDS)
.keepAliveTimeout(5, TimeUnit.SECONDS);
The above config will ensure to terminate sticky gRPC connections, and AWS ALB can do its job to load balance requests uniformly.
There are other options that you can try depending upon your use case, e.g retries, etc. See ManagedChannelBuilder

Slow message consumption using AmazonSQSClient

So, i used concurrency in spring jms 50-100, allowing max connections upto 200. Everything is working as expected but if i try to retrieve 100k messages from queue, i mean there are 100k messages on my sqs and i reading them through the spring jms normal approach.
#JmsListener
Public void process (String message) {
count++;
Println (count);
//code
}
I am seeing all the logs in my console but after around 17k it starts throwing exceptions
Something like : aws sdk exception : port already in use.
Why do i see this exception and how do. I get rid of it?
I tried looking on the internet for it. Couldn't find anything.
My setting :
Concurrency 50-100
Set messages per task :50
Client acknowledged
timestamp=10:27:57.183, level=WARN , logger=c.a.s.j.SQSMessageConsumerPrefetch, message={ConsumerPrefetchThread-30} Encountered exception during receive in ConsumerPrefetch thread,
javax.jms.JMSException: AmazonClientException: receiveMessage.
at com.amazon.sqs.javamessaging.AmazonSQSMessagingClientWrapper.handleException(AmazonSQSMessagingClientWrapper.java:422)
at com.amazon.sqs.javamessaging.AmazonSQSMessagingClientWrapper.receiveMessage(AmazonSQSMessagingClientWrapper.java:339)
at com.amazon.sqs.javamessaging.SQSMessageConsumerPrefetch.getMessages(SQSMessageConsumerPrefetch.java:248)
at com.amazon.sqs.javamessaging.SQSMessageConsumerPrefetch.run(SQSMessageConsumerPrefetch.java:207)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Address already in use: connect
Update : i looked for the problem and it seems that new sockets are being created until every sockets gets exhausted.
My spring jms version would be 4.3.10
To replicate this problem just do the above configuration with the max connection as 200 and currency set to 50-100 and push some 40k messages to the sqs queue.. One can use https://github.com/adamw/elasticmq this as a local stack server which replicates Amazon sqs.. After being done till here. Comment jms listener and use soap ui load testing and call the send message to fire many messages. Just because you commented #jmslistener annotation, it won't consume messages from queue. Once you see that you have sent 40k messages, stop. Uncomment #jmslistener and restart the server.
Update :
DefaultJmsListenerContainerFactory factory =
new DefaultJmsListenerContainerFactory();
factory.setConnectionFactory(connectionFactory);
factory.setDestinationResolver(new DynamicDestinationResolver());
factory.setErrorHandler(Throwable::printStackTrace);
factory.setConcurrency("50-100");
factory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
return factory;
Update :
SQSConnectionFactory connectionFactory = new SQSConnectionFactory( new ProviderConfiguration(), amazonSQSclient);
Update :
Client configuration details :
Protocol : HTTP
Max connections : 200
Update :
I used cache connection factory class and it seems. I read on stack overflow and in their official documentation to not use cache connection factory class and default jms listener container factory.
https://stackoverflow.com/a/21989895/5871514
It's gives the same error that i got before though.
update
My goal is to get a 500 tps, i.e i should be able to consume that much.. So i tried this method and it seems I can reach 100-200, but not more than that.. Plus this thing is a blocker at high concurrency .. If you use it.. If you have some better solution to achieve it.. I am all ears.
**updated **
I am using amazonsqsclient

Starvation on the Consumer
One possible optimization that JMS clients tend to implement, is a message consumption buffer or "prefetch". This buffer is sometimes tunable via the number of messages or by a buffer size in bytes.
The intention is to prevent the consumer from going to the server every single time it receives a messages, rather than pulling multiple messages in a batch.
In an environment where you have many "fast consumers" (which is the opinionated view these libraries may take), this prefetch is set to a somewhat high default in order to minimize these round trips.
However, in an environment with slow message consumers, this prefetch can be a problem. The slow consumer is holding up messaging consumption for those prefetched messages from the faster consumer. In a highly concurrent environment, this can cause starvation quickly.
That being the case the SQSConnectionFactory has a property for this:
SQSConnectionFactory sqsConnectionFactory = new SQSConnectionFactory( new ProviderConfiguration(), amazonSQSclient);
sqsConnectionFactory.setNumberOfMessagesToPrefetch(0);
Starvation on the Producer (i.e. via JmsTemplate)
It's very common for these JMS implementations to expect be interfaced to the broker via some intermediary. These intermediaries actually cache and reuse connections or use a pooling mechanism to reuse them. In the Java EE world, this is usually taken care of a JCA adapter or other method on a Java EE server.
Because of the way Spring JMS works, it expects an intermediary delegate for the ConnectionFactory to exist to do this caching/pooling. Otherwise, when Spring JMS wants to connect to the broker, it will attempt to open a new connection and session (!) every time you want to do something with the broker.
To solve this, Spring provides a few options. The simplest being the CachingConnectionFactory, which caches a single Connection, and allows many Sessions to be opened on that Connection. A simple way to add this to your #Configuration above would be something like:
#Bean
public ConnectionFactory connectionFactory(AmazonSQSClient amazonSQSclient) {
SQSConnectionFactory sqsConnectionFactory = new SQSConnectionFactory(new ProviderConfiguration(), amazonSQSclient);
// Doing the following is key!
CachingConnectionFactory connectionfactory = new CachingConnectionFactory();
connectionfactory.setTargetConnectionFactory(sqsConnectionFactory);
// Set the #connectionfactory properties to your liking here...
return connectionFactory;
}
If you want something more fancy as a JMS pooling solution (which will pool Connections and MessageProducers for you in addition to multiple Sessions), you can use the reasonably new PooledJMS project's JmsPoolConnectionFactory, or the like, from their library.

Gremlin server withRemote connection closed - how to reconnect automatically?

I am using withRemote to connect my java application to gremlin server running in AWS with dynamodb storage backend. I am getting connection timeout after few seconds (~3.3 seconds):
org.apache.tinkerpop.gremlin.process.remote.RemoteConnectionException: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.nio.channels.ClosedChannelException]]
I need to figure out how to reconnect which means detecting if the connection is closed. I am not sure how to detect that. I get the above exception when I use the graph traversal, is there a way to discover it before and reconnect or is there an option in configuration that allows reconnecting automatically (like create new connection before this one closes) so my application is always connected?
In case you need, this is how I am doing connection - currently connection part is singleton when the application starts:
this.graph = EmptyGraph.instance();
GryoMessageSerializerV1d0 gryoMessageSerializerV1d0 = new GryoMessageSerializerV1d0(
GryoMapper.build().addRegistry(JanusGraphIoRegistry.getInstance()));
this.cluster = Cluster.build().serializer(gryoMessageSerializerV1d0)
.addContactPoint(configuration.getString("graphDb.host", "localhost"))
.port(configuration.getInt("graphDb.port", 8182)).create();
this.graphTraversalSource = this.graph.traversal().withRemote(DriverRemoteConnection.using(cluster));

I feel like this problem is already solved with connection.keepAlive configuration option. It defaults to 180 seconds so it's longer than your timeout of 60 seconds in your load balancer which is why it gives up.
That said, the driver should be reconnecting on its own. It's constantly trying to do that given the connectionPool.reconnectInterval but perhaps there is a condition where you're quickly exhausting all the connections to the point of getting that error....not sure. Either way, hopefully the

Timeout implementation of JPA transactions and Session invalidation

I have been handling a application which uses wicket+JPA+springs technologies.Recently we got many 5XX error in logs(greater than threshhold).During that time,There were some general problems due to unstable response times of the mainframe db2 which is backend for our application.
But after that once the mainframe is OK this application servers did not come to normal again.
There are a lot of hanging transactions (from my appplication).
There are many threads in the server that may be hung.
As users will go on keeping login or will access the links in aplication during that time the situation becomes worse.
When I look at webspehere logs I found following exceptions:
00000035 ThreadMonitor W WSVR0605W: Thread "WebContainer : 88" (000005ac)
has been active for 637111 milliseconds and may be hung.
There is/are 43 thread(s) in total in the server that may be hung.
In application logs i found following exceptions:
-->CouldNotLockPageException: Could not lock page 4. Attempt lasted 3 minutes
-->DefaultExceptionMapper - Connection lost, give up responding.
org.apache.wicket.protocol.http.servlet.ResponseIOException:
com.ibm.wsspi.webcontainer.ClosedConnectionException: OutputStream encountered error during
write.
--> JDBCExceptionReporter - [jcc][t4][2030][11211][3.67.27] A communication error occurred
during operations on the connection's underlying socket, socket input stream,
or socket output stream.
Error location: Reply.fill() - socketInputStream.read (-1). Message:
Connection reset. ERRORCODE=-4499, SQLSTATE=08001DSRA0010E: SQL State = 08001, Error Code = - 4.499
Now we are working on the solutions to this problem.The follwing are two solutions that we are thinking as of now.
1.I have gone through many forums and found that whenever we get CouldNotLockPageException then it would be better to invaidate the session and force user to login page.Currently We do not have session invalidation (logout) mechanism.So we will implement that one.
2.We need to implement transaction timeouts so that we can stop hanging transactions.
I need solution for this problem from java or server side.Here we are using wicket,jpa and springs frameworks.I have few queries.
1.How can we implement transaction timeouts in the above frameworks?
2.Will invalidating session can stop hanging transaction or threads that may hung?

Since you are already using Spring, it's as simple as that:
#Transactional(timeout = 300)
The Transaction annotation allow you to supply a timeout value(in seconds) and the transaction manager will forward it to the JTA transaction manager or your Data Source connection pool. It works nice with Bitronix Transaction Manager, which automatically picks it up.
You also need to make sure the java.sql.Conenction are always being closed and Transaction are always committed (when all operations succeeded) or rollbacked on failure.
Invalidating the user http session has nothing to do with jdbc connections. Your jdbc connection should always be committed/rollbacked and closed(which in case on connection pooling, will release the connection to the pool).
And make sure the max pool size is not greater than tour db max concurrent connections setting.

Spring Integration - Reliable TCP for high volume application

I'm using Spring Integration for TCP server which keeps connections to a few thousand clients. I need the server to throttle clients in case of excessive load and not to lose messages.
My server configuration:
<task:executor id="myTaskExecutor"
pool-size="4-8"
queue-capacity="0"
rejection-policy="CALLER_RUNS" />
<int-ip:tcp-connection-factory id="serverTcpConFact"
type="server"
port="60000"
using-nio="true"
single-use="false"
so-timeout="300000"
task-executor="myTaskExecutor" />
<int-ip:tcp-inbound-channel-adapter id="tcpInboundAdapter"
channel="tcpInbound"
connection-factory="serverTcpConFact" />
<channel id="tcpInbound" />
<service-activator input-channel="tcpInbound"
ref="myService"
method="test" />
<beans:bean id="myService" class="org.test.tcpserver.MyService" />
Since the default task executor for the connection factory is unbounded, I use a pooled task executor to prevent out of memory errors.
A simple client for load testing:
public class TCPClientTest {
static Socket socket;
static List<Socket> sl = new ArrayList<>();
static DataOutputStream out;
public static void main(String[] args) throws Exception {
for (int i = 0; i < 10000; i++) {
socket = new Socket("localhost", 60000);
sl.add(socket);
out = new DataOutputStream(socket.getOutputStream());
out.writeBytes("connection " + i + "\r\n");
System.out.println("Using connection #" + i);
}
System.in.read();
}
}
When I run it, the server only receives about 10-20 messages and then the client gets the "Connection refused: connect" exception. After that the server can't accept any new connections anymore, even after the connection timeout. Increasing the pool size only helps to get a little bit more messages.
EDIT
I'm using Spring Integration 3.0.2.RELEASE. For production I'm using 8-40 threads, but it only makes this test to fail later, after several hundred connections.
MyService.test() doesn't do much...
public class MyService {
public void test(byte[] input) {
System.out.println("Received: " + new String(input));
}
}
Here is the log with trace level logging.
Sources

I see what the problem is, please open a JIRA issue.
The issue is the CALLER_RUNS rejection policy with a 0 length queue in the executor.
There is one thread that handles all IO events (usually myTaskExecutor-1); when a read event fires he queues an execution to read the data; the reader thread queues an execution to assemble the data (which will block until a complete message - in your case terminated by the CRLF - arrives).
In this case, when there are no threads available, the CALLER_RUNS policy means the IO selector thread does the read, and becomes the assembler thread, which blocks waiting for data that won't arrive because he is blocked and would later have read the data after scheduling a different thread to block. Because he is blocked, he can't handle new accept events.
Here is a log from my test showing the issue...
TRACE: [May-18 10:43:38,923][myTaskExecutor-1] tcp.connection.TcpNioServerConnectionFactory - Port 60000 SelectionCount: 2
DEBUG: [May-18 10:43:38,923][myTaskExecutor-1] tcp.connection.TcpNioConnection - localhost:58509:60000:bdc36c59-c31b-470e-96c3-6270e7c46a2f Reading...
DEBUG: [May-18 10:43:38,924][myTaskExecutor-1] tcp.connection.TcpNioConnection - localhost:58509:60000:bdc36c59-c31b-470e-96c3-6270e7c46a2f Running an assembler
TRACE: [May-18 10:43:38,924][myTaskExecutor-1] tcp.connection.TcpNioConnection - localhost:58509:60000:bdc36c59-c31b-470e-96c3-6270e7c46a2f Nio message assembler running...
DEBUG: [May-18 10:43:38,926][myTaskExecutor-1] tcp.serializer.ByteArrayCrLfSerializer - Available to read:0
The second line shows the selector thread being used to do the read; he detects that an assembler is needed for this socket, and becomes the assembler, blocking, waiting for data.
Do you really believe there will be an issue using an unbounded task executor? These events are generally pretty short lived so threads will be recycled pretty quickly.
Increasing the executor's queue capacity above 0 should help too, but it won't completely assure the problem won't happen (although a large queue size is unlikely to be hit).
I am yet not sure how to fix this, aside from using a dedicated task executor for the IO selector and reader threads so they will never be used as an assembler.

yesterday I did write a sample just to create tcp high performance server code using spring integration. I tested it successfully with 1000 concurrent client requests using JMeter TCP sampler.
Here is the code - https://github.com/rajeshgheware/spring-integration-samples including JMeter test config file.
I successfully tested with 1000 concurrent tcp client requests on 64bit Laptop having Intel core i5 M520 2.4GHz (both the server code and JMeter test running on this machine)
I also tried with 1500 concurrent client requests but observed that server could not honor many requests. I will keep trying to enhance this code to serve 10000 concurrent client requests (I know I may need to get good EC2 machine from Amazon for this test :) )

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.