Spring Integration - Reliable TCP for high volume application

Spring Integration - Reliable TCP for high volume application - java

I'm using Spring Integration for TCP server which keeps connections to a few thousand clients. I need the server to throttle clients in case of excessive load and not to lose messages.
My server configuration:
<task:executor id="myTaskExecutor"
pool-size="4-8"
queue-capacity="0"
rejection-policy="CALLER_RUNS" />
<int-ip:tcp-connection-factory id="serverTcpConFact"
type="server"
port="60000"
using-nio="true"
single-use="false"
so-timeout="300000"
task-executor="myTaskExecutor" />
<int-ip:tcp-inbound-channel-adapter id="tcpInboundAdapter"
channel="tcpInbound"
connection-factory="serverTcpConFact" />
<channel id="tcpInbound" />
<service-activator input-channel="tcpInbound"
ref="myService"
method="test" />
<beans:bean id="myService" class="org.test.tcpserver.MyService" />
Since the default task executor for the connection factory is unbounded, I use a pooled task executor to prevent out of memory errors.
A simple client for load testing:
public class TCPClientTest {
static Socket socket;
static List<Socket> sl = new ArrayList<>();
static DataOutputStream out;
public static void main(String[] args) throws Exception {
for (int i = 0; i < 10000; i++) {
socket = new Socket("localhost", 60000);
sl.add(socket);
out = new DataOutputStream(socket.getOutputStream());
out.writeBytes("connection " + i + "\r\n");
System.out.println("Using connection #" + i);
}
System.in.read();
}
}
When I run it, the server only receives about 10-20 messages and then the client gets the "Connection refused: connect" exception. After that the server can't accept any new connections anymore, even after the connection timeout. Increasing the pool size only helps to get a little bit more messages.
EDIT
I'm using Spring Integration 3.0.2.RELEASE. For production I'm using 8-40 threads, but it only makes this test to fail later, after several hundred connections.
MyService.test() doesn't do much...
public class MyService {
public void test(byte[] input) {
System.out.println("Received: " + new String(input));
}
}
Here is the log with trace level logging.
Sources

I see what the problem is, please open a JIRA issue.
The issue is the CALLER_RUNS rejection policy with a 0 length queue in the executor.
There is one thread that handles all IO events (usually myTaskExecutor-1); when a read event fires he queues an execution to read the data; the reader thread queues an execution to assemble the data (which will block until a complete message - in your case terminated by the CRLF - arrives).
In this case, when there are no threads available, the CALLER_RUNS policy means the IO selector thread does the read, and becomes the assembler thread, which blocks waiting for data that won't arrive because he is blocked and would later have read the data after scheduling a different thread to block. Because he is blocked, he can't handle new accept events.
Here is a log from my test showing the issue...
TRACE: [May-18 10:43:38,923][myTaskExecutor-1] tcp.connection.TcpNioServerConnectionFactory - Port 60000 SelectionCount: 2
DEBUG: [May-18 10:43:38,923][myTaskExecutor-1] tcp.connection.TcpNioConnection - localhost:58509:60000:bdc36c59-c31b-470e-96c3-6270e7c46a2f Reading...
DEBUG: [May-18 10:43:38,924][myTaskExecutor-1] tcp.connection.TcpNioConnection - localhost:58509:60000:bdc36c59-c31b-470e-96c3-6270e7c46a2f Running an assembler
TRACE: [May-18 10:43:38,924][myTaskExecutor-1] tcp.connection.TcpNioConnection - localhost:58509:60000:bdc36c59-c31b-470e-96c3-6270e7c46a2f Nio message assembler running...
DEBUG: [May-18 10:43:38,926][myTaskExecutor-1] tcp.serializer.ByteArrayCrLfSerializer - Available to read:0
The second line shows the selector thread being used to do the read; he detects that an assembler is needed for this socket, and becomes the assembler, blocking, waiting for data.
Do you really believe there will be an issue using an unbounded task executor? These events are generally pretty short lived so threads will be recycled pretty quickly.
Increasing the executor's queue capacity above 0 should help too, but it won't completely assure the problem won't happen (although a large queue size is unlikely to be hit).
I am yet not sure how to fix this, aside from using a dedicated task executor for the IO selector and reader threads so they will never be used as an assembler.

yesterday I did write a sample just to create tcp high performance server code using spring integration. I tested it successfully with 1000 concurrent client requests using JMeter TCP sampler.
Here is the code - https://github.com/rajeshgheware/spring-integration-samples including JMeter test config file.
I successfully tested with 1000 concurrent tcp client requests on 64bit Laptop having Intel core i5 M520 2.4GHz (both the server code and JMeter test running on this machine)
I also tried with 1500 concurrent client requests but observed that server could not honor many requests. I will keep trying to enhance this code to serve 10000 concurrent client requests (I know I may need to get good EC2 machine from Amazon for this test :) )

Related

Slow message consumption using AmazonSQSClient

So, i used concurrency in spring jms 50-100, allowing max connections upto 200. Everything is working as expected but if i try to retrieve 100k messages from queue, i mean there are 100k messages on my sqs and i reading them through the spring jms normal approach.
#JmsListener
Public void process (String message) {
count++;
Println (count);
//code
}
I am seeing all the logs in my console but after around 17k it starts throwing exceptions
Something like : aws sdk exception : port already in use.
Why do i see this exception and how do. I get rid of it?
I tried looking on the internet for it. Couldn't find anything.
My setting :
Concurrency 50-100
Set messages per task :50
Client acknowledged
timestamp=10:27:57.183, level=WARN , logger=c.a.s.j.SQSMessageConsumerPrefetch, message={ConsumerPrefetchThread-30} Encountered exception during receive in ConsumerPrefetch thread,
javax.jms.JMSException: AmazonClientException: receiveMessage.
at com.amazon.sqs.javamessaging.AmazonSQSMessagingClientWrapper.handleException(AmazonSQSMessagingClientWrapper.java:422)
at com.amazon.sqs.javamessaging.AmazonSQSMessagingClientWrapper.receiveMessage(AmazonSQSMessagingClientWrapper.java:339)
at com.amazon.sqs.javamessaging.SQSMessageConsumerPrefetch.getMessages(SQSMessageConsumerPrefetch.java:248)
at com.amazon.sqs.javamessaging.SQSMessageConsumerPrefetch.run(SQSMessageConsumerPrefetch.java:207)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Address already in use: connect
Update : i looked for the problem and it seems that new sockets are being created until every sockets gets exhausted.
My spring jms version would be 4.3.10
To replicate this problem just do the above configuration with the max connection as 200 and currency set to 50-100 and push some 40k messages to the sqs queue.. One can use https://github.com/adamw/elasticmq this as a local stack server which replicates Amazon sqs.. After being done till here. Comment jms listener and use soap ui load testing and call the send message to fire many messages. Just because you commented #jmslistener annotation, it won't consume messages from queue. Once you see that you have sent 40k messages, stop. Uncomment #jmslistener and restart the server.
Update :
DefaultJmsListenerContainerFactory factory =
new DefaultJmsListenerContainerFactory();
factory.setConnectionFactory(connectionFactory);
factory.setDestinationResolver(new DynamicDestinationResolver());
factory.setErrorHandler(Throwable::printStackTrace);
factory.setConcurrency("50-100");
factory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
return factory;
Update :
SQSConnectionFactory connectionFactory = new SQSConnectionFactory( new ProviderConfiguration(), amazonSQSclient);
Update :
Client configuration details :
Protocol : HTTP
Max connections : 200
Update :
I used cache connection factory class and it seems. I read on stack overflow and in their official documentation to not use cache connection factory class and default jms listener container factory.
https://stackoverflow.com/a/21989895/5871514
It's gives the same error that i got before though.
update
My goal is to get a 500 tps, i.e i should be able to consume that much.. So i tried this method and it seems I can reach 100-200, but not more than that.. Plus this thing is a blocker at high concurrency .. If you use it.. If you have some better solution to achieve it.. I am all ears.
**updated **
I am using amazonsqsclient

Starvation on the Consumer
One possible optimization that JMS clients tend to implement, is a message consumption buffer or "prefetch". This buffer is sometimes tunable via the number of messages or by a buffer size in bytes.
The intention is to prevent the consumer from going to the server every single time it receives a messages, rather than pulling multiple messages in a batch.
In an environment where you have many "fast consumers" (which is the opinionated view these libraries may take), this prefetch is set to a somewhat high default in order to minimize these round trips.
However, in an environment with slow message consumers, this prefetch can be a problem. The slow consumer is holding up messaging consumption for those prefetched messages from the faster consumer. In a highly concurrent environment, this can cause starvation quickly.
That being the case the SQSConnectionFactory has a property for this:
SQSConnectionFactory sqsConnectionFactory = new SQSConnectionFactory( new ProviderConfiguration(), amazonSQSclient);
sqsConnectionFactory.setNumberOfMessagesToPrefetch(0);
Starvation on the Producer (i.e. via JmsTemplate)
It's very common for these JMS implementations to expect be interfaced to the broker via some intermediary. These intermediaries actually cache and reuse connections or use a pooling mechanism to reuse them. In the Java EE world, this is usually taken care of a JCA adapter or other method on a Java EE server.
Because of the way Spring JMS works, it expects an intermediary delegate for the ConnectionFactory to exist to do this caching/pooling. Otherwise, when Spring JMS wants to connect to the broker, it will attempt to open a new connection and session (!) every time you want to do something with the broker.
To solve this, Spring provides a few options. The simplest being the CachingConnectionFactory, which caches a single Connection, and allows many Sessions to be opened on that Connection. A simple way to add this to your #Configuration above would be something like:
#Bean
public ConnectionFactory connectionFactory(AmazonSQSClient amazonSQSclient) {
SQSConnectionFactory sqsConnectionFactory = new SQSConnectionFactory(new ProviderConfiguration(), amazonSQSclient);
// Doing the following is key!
CachingConnectionFactory connectionfactory = new CachingConnectionFactory();
connectionfactory.setTargetConnectionFactory(sqsConnectionFactory);
// Set the #connectionfactory properties to your liking here...
return connectionFactory;
}
If you want something more fancy as a JMS pooling solution (which will pool Connections and MessageProducers for you in addition to multiple Sessions), you can use the reasonably new PooledJMS project's JmsPoolConnectionFactory, or the like, from their library.

Play framework resource starvation after a few days

I am experiencing an issue in Play 2.5.8 (Java) where database related service endpoints starts timing out after a few days even though the server CPU & memory usage seems fine. Endpoints that does not access the DB continue to work perfectly.
The application runs on a t2.medium EC2 instance with a t2.medium MySQL RDS, both in the same availability zone. Most HTTP calls do lookups/updates to the database with around 8-12 requests per second, and there are also ±800 WebSocket connections/actors with ±8 requests/second (90% of the WebSocket messages does not access the database). DB operations are mostly simple lookups & updates taking around 100ms.
When using only the default thread pool it took about 2 days to reach the deadlock, and after moving the database requests to a separate thread pool as per https://www.playframework.com/documentation/2.5.x/ThreadPools#highly-synchronous, it improved but only to about 4 days.
This is my current thread config in application.conf:
akka {
actor {
guardian-supervisor-strategy = "actors.RootSupervisionStrategy"
}
loggers = ["akka.event.Logging$DefaultLogger",
"akka.event.slf4j.Slf4jLogger"]
loglevel = WARNING
## This pool handles all HTTP & WebSocket requests
default-dispatcher {
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 64
}
}
db-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 210
}
}
}
Database configuration:
play.db.pool="default"
play.db.prototype.hikaricp.maximumPoolSize=200
db.default.driver=com.mysql.jdbc.Driver
I have played around with the amount of connections in the DB pool & adjusting the size of the default & db-dispatcher pool size but it doesn't seem to make any difference. It feels I'm missing something fundamental about Play's thread pools & configuration as I don't think the load on the server should not be an issue for Play to handle.

After more investigation I found that the issue is not related to thread pool configuration at all, but rather TCP connections that build up due to WS reconnections until the server (or Play framework) cannot accept any more connections. When this happens, only established TCP connections are serviced which mostly includes the established WebSocket connections.
I could not yet determine why the connections are not managed/closed properly.
My issue relates to this question:
Play 2.5 WebSocket Connection Build

Preventing RabbitMQ from blocking upstream services

I have a Spring application that consumes messages on a specific port (say 9001), restructures them and then forwards to a Rabbit MQ server. The code segment is:
private void send(String routingKey, String message) throws Exception {
String exchange = applicationConfiguration.getAMQPExchange();
String exchangeType = applicationConfiguration.getAMQPExchangeType();
Connection connection = myConnection.getConnection();
Channel channel = connection.createChannel();
channel.exchangeDeclare(exchange, exchangeType);
channel.basicPublish(exchange, routingKey, null, message.getBytes());
log.debug(" [CORE: AMQP] Sent message with key {} : {}",routingKey, message);
}
If the Rabbit MQ server fails (crashes, runs out of RAM, turned off etc) the code above blocks, preventing the upstream service from receiving messages (a bad thing). I am looking for a way of preventing this behaviour whilst not losing mesages so that at some time in the future they can be resent.
I am not sure how best to address this. One option may be to queue the messages to a disk file and then use a separate thread to read and forward to the Rabbit MQ server?

If I understand correctly, the issue you are describing is a known JDK socket behaviour when the connection is lost mid-write. See this mailing list thread: http://markmail.org/thread/3vw6qshxsmu7fv6n.
Note that if RabbitMQ is shut down, the TCP connection should be closed in a way that's quickly observable by the client. However, it is true that stale TCP connections can take
a while to be detected, that's why RabbitMQ's core protocol has heartbeats. Set heartbeat
interval to a low value (say, 6-8) and the client itself will notice unresponsive peer
in that amount of time.
You need to use Publisher confirms [1] but also account for the fact that the app itself
can go down right before sending a message. As you rightly point out, having a disk-based
WAL (write-ahead log) is a common solution for this problem. Note that it is both quite
tricky to get right and still leaves some time window where your app process shutting down can result in an unpublished and unlogged message.
No promises on the time frame but the idea of adding WAL to the Java client has been discussed.
http://www.rabbitmq.com/confirms.html

Apache Mina Idle Monitor

I have been developing my first TCP/Socket based application with Apache Mina, it looks great and easy to do things. I just want to ask a question here about Mina.
The server impose an idle time of 5 second will terminate the socket connection, so we have to send periodic heartbeat (echo message / keepalive) to make sure connection is alive. Sort of keepalive mechanism.
There's one way that we send blindly echo/heartbeat message just before every 5 seconds. I am thinking, there should be smart/intelligent way "Idle Monitor" if I am sending my business message and do not come to idle time i.e. 5 second, I should not issue heartbeat message. Heartbeat message will be sent if whole connection is idle, so that we save bandwidth and fast reading & writing on socket.

You can achieve it by using Keep Alive Filter (already present in mina).
Alternatively, you can achieve a smarter way of sending echo/heart beat by setting session idle timeout of client a bit smaller than idle timeout of server. For example:
For server side
NioSocketAcceptor.getSessionConfig().setIdleTime(IdleStatus.BOTH_IDLE, 5);
and for client side it would be
NioSocketConnector.getSessionConfig().setIdleTime(IdleStatus.BOTH_IDLE, 3);
Now, if there is no communication for lets say 3 seconds, a sessionIdle will be triggred at the client side ( and it will not be triggered at server side as timeout there is 5 seconds) and you can send an echo. This will keep the session alive. The echo will be sent only if the session is idle.
Note: I am assuming that at session idle, session is being closed at the server side. If it is other way around you will need to switch values of session idle timeout(e.g. 3 seconds for server and 5 seconds for client) and echo will be sent from server.

(I hope I'm understanding the question correctly)
I was having trouble keeping my session alive and this question came up on Google search results so I'm hoping someone else will find it useful:
#Test
public void testClientWithHeartBeat() throws Exception {
SshClient client = SshClient.setUpDefaultClient();
client.getProperties().put(ClientFactoryManager.HEARTBEAT_INTERVAL, "500");
client.start();
ClientSession session = client.connect("localhost", port).await().getSession();
session.authPassword("smx", "smx").await().isSuccess();
ClientChannel channel = session.createChannel(ClientChannel.CHANNEL_SHELL);
int state = channel.waitFor(ClientChannel.CLOSED, 2000);
assertTrue((state & ClientChannel.CLOSED) == 0);
channel.close(false);
client.stop();
}
(Source: https://issues.apache.org/jira/browse/SSHD-185)
In newer versions (e.g. version 2.8.0), enabling heartbeats changed to CoreModuleProperties.HEARTBEAT_INTERVAL.set(client, Duration.ofMillis(500));

I'm not sure I totally understand your question, but you can send a heartbeat in an overridden sessionIdle method of the IoHandlerAdapter. You don't need to necessarily close a session just because Mina on the server side calls Idle. As far as a more intelligent way of maintaining an active connection between and Server and Client without this type of heartbeat communication I have never heard of one.
Here is an interesting read of how microsoft handles their heartbeat in ActiveSync. I personally used this methodology when using mina in my client/server application. Hope this helps you some.

Client thread hang emulation blocks server from accepting any I/O for the time client is set to wait

As the topic suggests I have a server and some clients.
The server accepts I/O connections concurrently (no queueing in socket connections) but I have this troubling issue and I do not know how to bypass it!
If I force a client to throw an I/O Exception the server detects it and terminates the client thread correctly (verified from Task Manager (Windows) and System Monitor (Ubuntu) ). But If I emulate an I/O that is "hanging" like i.e. Thread.sleep(60*1000);or
private static Object lock = new Object();
synchronized(lock) {
while (true) {
try {
lock.wait();
} catch (InterruptedException e) {
/* Foo */
}
}
}
then all subsequent I/O operations (connection & data transfer) seem to block or wait until the "hanging" client is terminated. The applications makes use of the ExecutorService so if the "hanging" client does not complete the operations in the suggested time limit then the task will time out and the client is forced to exit. The subsequent "blocked" I/Os will resume but I wonder why the server doesn't accept any I/O connections or performs any I/O operations when a client "hangs"?
NOTE:The client threading takes place in the server main like this:
while (true) {
accept client connection;
submit client task;
||
\ /
\/
// ExecutorService here in the form
// spService.submit(new Callable<Tuple<String[], BigDecimal[]>>() {
// ... code ... }}).get(taskTimeout, taskTimeUnit);
check task result & perform cleanup if result is null;
otherwise continue;
}

The Problem :
This may very well indicate that your server ACCEPTS client connections concurrently, however, it only handles these connections synchronously. That means that even if a million clients connect, successfully, at any given time, if anyone of them takes a long time (or hangs), it will hold up the others.
The TEST:
To verify this : I would toggle the amount of time a client takes to connect by adding Thread.sleep statments(1000) in your clients.
Expected result :
I believe you will see that even adding a single Thread.sleep(1000) statement in your client delays all other connecting clients by 1000.

I think I have found the source of my problems!
I do use one thread-per-client model but I run my tests locally i.e. in the same machine which means all of them have the same IP! So each client is assigned the same IP with the server! I guess that this leaves server and clients to differ only in port number but since each client is mapped to a different localport for each server connection then the server shouldn't block. I have confirmed that each client and server use different I/Os (compared references) and I wrap their sockets using <Input/Output>Streams to BufferedReaders & PrintWriters but still when a client hangs all other clients hang too (so maybe the I/O channels are indeed the same???)!I will test this on another machine and check the results back with you! :)
EDIT: Confirmed the erratic behaviour. It seems that even with remote clients if one hangs the other clients seem to hang too! :/
Don't know but I am determined to fix this. It's just that it's pretty weird since I am pretty sure I use one thread-per-client (I/Os differ, client sockets differ, IPs seem to be not a problem, I even map each client in the server to a localport of my choice ...)
May be I'll switch to NIO if I don't find a solution soon enough.
SOLUTION: Solved the problem! It seemed that the ExecutorService had to be run in a seperate thread otherwise if an I/O in a client blocked, all I/Os would block! That's strange given the fact that I've tried both an Executors.newFixedThreadPool(<nThreads>); and Executors.newCachedThreadPool(); and the client actions (aka I/Os) should take place in a new Thread for each client.
In any case, I used a method and wrapped the calls so each client instace would use a final ExecutorService baseWorker = Executors.newSingleThreadExecutor(); and created a new Thread explicitly each time using <Thread instance>.start(); so each thread would run in the background :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.