HazelCast Perfomance Test with Spring Batch throws TargetDisconnectedException - java

I have a HazelCast Server setup with 2 nodes and the server has been healthy and has never shown any issues. My clients are from multiple instances of spring batch and each spring batch has 120 threads and there are five instances of spring batch which means there might be around 120*5=600 threads trying to access IMap to set/get values.
When I started around 3 instances of the batch, the time taken to set values in IMap goes beyond 50 seconds and slowly the following exception is thrown:
com.hazelcast.spi.exception.TargetDisconnectedException
at com.hazelcast.client.spi.impl.ClientCallFuture.get(ClientCallFuture.java:128)
at com.hazelcast.client.spi.impl.ClientCallFuture.get(ClientCallFuture.java:111)
at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:110)
at com.hazelcast.client.proxy.ClientMapProxy.set(ClientMapProxy.java:380)
at com.ebay.app.raptor.dfmailbat.components.cache.client.processor.CacheClient.setDealsResponseInCache(CacheClient.java:101)
at com.ebay.app.raptor.dfmailbat.components.deal.finder.service.manager.DealFinderManager.getDeals(DealFinderManager.java:59)
Each Spring has one static instance of IMap and uses the same instance to set/get values. Something like this:
static {
ClientConfig clientConfig = new ClientConfig();
List<String> addresses = getClusterAddresses();
if (addresses != null) {
clientConfig.getNetworkConfig().setAddresses(addresses);
s_client = HazelcastClient.newHazelcastClient(clientConfig);
}
else {
s_logger.log(LogLevel.ERROR, "No host in Database for hazelcast client set up");
}
if (s_client != null) {
s_map = s_client.getMap("ItemDealsMap");
}
}
The s_map is used by all the threads to set/get entries from IMap. I set a eviction time of 12 hours when I set the entries into the IMap. I am using HazelCast 3.3 on both the server and the client. This issue is consistently reproducible when the number of concurrent threads are increased. When I shutdown the spring batch and start again, it works well. Could you please help me with this.

Related

Hazelcast in SpringBoot Admin to run 2 instances via Docker Swarm

I'm very new to SpringBoot Admin, HazelCast and Docker Swarm...
What I'm trying to do is to run 2 instances of SpringBoot Admin Server, in Docker Swarm.
It works fine with one instance. I have every feature of SBA working well.
If I set the number of replicas to "2", with the following in swarm, then the logging in page doesn't work (it shows up but I can't log in, with no error in the console):
mode: replicated
replicas: 2
update_config:
parallelism: 1
delay: 60s
failure_action: rollback
order: start-first
monitor: 60s
rollback_config:
parallelism: 1
delay: 60s
failure_action: pause
order: start-first
monitor: 60s
restart_policy:
condition: any
delay: 60s
max_attempts: 3
window: 3600s
My current HazelCast config is the following (as specified in SpringBoot Admin doc):
#Bean
public Config hazelcast() {
// This map is used to store the events.
// It should be configured to reliably hold all the data,
// Spring Boot Admin will compact the events, if there are too many
MapConfig eventStoreMap = new MapConfig(DEFAULT_NAME_EVENT_STORE_MAP).setInMemoryFormat(InMemoryFormat.OBJECT)
.setBackupCount(1).setEvictionPolicy(EvictionPolicy.NONE)
.setMergePolicyConfig(new MergePolicyConfig(PutIfAbsentMapMergePolicy.class.getName(), 100));
// This map is used to deduplicate the notifications.
// If data in this map gets lost it should not be a big issue as it will atmost
// lead to
// the same notification to be sent by multiple instances
MapConfig sentNotificationsMap = new MapConfig(DEFAULT_NAME_SENT_NOTIFICATIONS_MAP)
.setInMemoryFormat(InMemoryFormat.OBJECT).setBackupCount(1).setEvictionPolicy(EvictionPolicy.LRU)
.setMergePolicyConfig(new MergePolicyConfig(PutIfAbsentMapMergePolicy.class.getName(), 100));
Config config = new Config();
config.addMapConfig(eventStoreMap);
config.addMapConfig(sentNotificationsMap);
config.setProperty("hazelcast.jmx", "true");
// WARNING: This setups a local cluster, you change it to fit your needs.
config.getNetworkConfig().getJoin().getMulticastConfig().setEnabled(true);
TcpIpConfig tcpIpConfig = config.getNetworkConfig().getJoin().getTcpIpConfig();
tcpIpConfig.setEnabled(true);
// NetworkConfig network = config.getNetworkConfig();
// InterfacesConfig interfaceConfig = network.getInterfaces();
// interfaceConfig.setEnabled( true )
// .addInterface( "192.168.1.3" );
// tcpIpConfig.setMembers(singletonList("127.0.0.1"));
return config;
}
```
I guess these inputs are not enough for you to properly help, but since I don't really weel understand the way HazelCast is workging, I don't really know what is useful or not. So please don't hesitate to ask me for what is needed to help! :)
Do you guys have any idea of what I'm doing wrong?
Many thanks!
Multicast is not working in Docker Swarm in default overlay driver (at least it stated here).
I have tried to make it run with weave network plugin but without luck.
In my case, it was enough to switch Hazelcast to the TCP mode and provide the network I like to search for the other replicas.
Something like that:
-Dhz.discovery.method=TCP
-Dhz.network.interfaces=10.0.251.*
-Dhz.discovery.tcp.members=10.0.251.*

Slow message consumption using AmazonSQSClient

So, i used concurrency in spring jms 50-100, allowing max connections upto 200. Everything is working as expected but if i try to retrieve 100k messages from queue, i mean there are 100k messages on my sqs and i reading them through the spring jms normal approach.
#JmsListener
Public void process (String message) {
count++;
Println (count);
//code
}
I am seeing all the logs in my console but after around 17k it starts throwing exceptions
Something like : aws sdk exception : port already in use.
Why do i see this exception and how do. I get rid of it?
I tried looking on the internet for it. Couldn't find anything.
My setting :
Concurrency 50-100
Set messages per task :50
Client acknowledged
timestamp=10:27:57.183, level=WARN , logger=c.a.s.j.SQSMessageConsumerPrefetch, message={ConsumerPrefetchThread-30} Encountered exception during receive in ConsumerPrefetch thread,
javax.jms.JMSException: AmazonClientException: receiveMessage.
at com.amazon.sqs.javamessaging.AmazonSQSMessagingClientWrapper.handleException(AmazonSQSMessagingClientWrapper.java:422)
at com.amazon.sqs.javamessaging.AmazonSQSMessagingClientWrapper.receiveMessage(AmazonSQSMessagingClientWrapper.java:339)
at com.amazon.sqs.javamessaging.SQSMessageConsumerPrefetch.getMessages(SQSMessageConsumerPrefetch.java:248)
at com.amazon.sqs.javamessaging.SQSMessageConsumerPrefetch.run(SQSMessageConsumerPrefetch.java:207)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Address already in use: connect
Update : i looked for the problem and it seems that new sockets are being created until every sockets gets exhausted.
My spring jms version would be 4.3.10
To replicate this problem just do the above configuration with the max connection as 200 and currency set to 50-100 and push some 40k messages to the sqs queue.. One can use https://github.com/adamw/elasticmq this as a local stack server which replicates Amazon sqs.. After being done till here. Comment jms listener and use soap ui load testing and call the send message to fire many messages. Just because you commented #jmslistener annotation, it won't consume messages from queue. Once you see that you have sent 40k messages, stop. Uncomment #jmslistener and restart the server.
Update :
DefaultJmsListenerContainerFactory factory =
new DefaultJmsListenerContainerFactory();
factory.setConnectionFactory(connectionFactory);
factory.setDestinationResolver(new DynamicDestinationResolver());
factory.setErrorHandler(Throwable::printStackTrace);
factory.setConcurrency("50-100");
factory.setSessionAcknowledgeMode(Session.CLIENT_ACKNOWLEDGE);
return factory;
Update :
SQSConnectionFactory connectionFactory = new SQSConnectionFactory( new ProviderConfiguration(), amazonSQSclient);
Update :
Client configuration details :
Protocol : HTTP
Max connections : 200
Update :
I used cache connection factory class and it seems. I read on stack overflow and in their official documentation to not use cache connection factory class and default jms listener container factory.
https://stackoverflow.com/a/21989895/5871514
It's gives the same error that i got before though.
update
My goal is to get a 500 tps, i.e i should be able to consume that much.. So i tried this method and it seems I can reach 100-200, but not more than that.. Plus this thing is a blocker at high concurrency .. If you use it.. If you have some better solution to achieve it.. I am all ears.
**updated **
I am using amazonsqsclient
Starvation on the Consumer
One possible optimization that JMS clients tend to implement, is a message consumption buffer or "prefetch". This buffer is sometimes tunable via the number of messages or by a buffer size in bytes.
The intention is to prevent the consumer from going to the server every single time it receives a messages, rather than pulling multiple messages in a batch.
In an environment where you have many "fast consumers" (which is the opinionated view these libraries may take), this prefetch is set to a somewhat high default in order to minimize these round trips.
However, in an environment with slow message consumers, this prefetch can be a problem. The slow consumer is holding up messaging consumption for those prefetched messages from the faster consumer. In a highly concurrent environment, this can cause starvation quickly.
That being the case the SQSConnectionFactory has a property for this:
SQSConnectionFactory sqsConnectionFactory = new SQSConnectionFactory( new ProviderConfiguration(), amazonSQSclient);
sqsConnectionFactory.setNumberOfMessagesToPrefetch(0);
Starvation on the Producer (i.e. via JmsTemplate)
It's very common for these JMS implementations to expect be interfaced to the broker via some intermediary. These intermediaries actually cache and reuse connections or use a pooling mechanism to reuse them. In the Java EE world, this is usually taken care of a JCA adapter or other method on a Java EE server.
Because of the way Spring JMS works, it expects an intermediary delegate for the ConnectionFactory to exist to do this caching/pooling. Otherwise, when Spring JMS wants to connect to the broker, it will attempt to open a new connection and session (!) every time you want to do something with the broker.
To solve this, Spring provides a few options. The simplest being the CachingConnectionFactory, which caches a single Connection, and allows many Sessions to be opened on that Connection. A simple way to add this to your #Configuration above would be something like:
#Bean
public ConnectionFactory connectionFactory(AmazonSQSClient amazonSQSclient) {
SQSConnectionFactory sqsConnectionFactory = new SQSConnectionFactory(new ProviderConfiguration(), amazonSQSclient);
// Doing the following is key!
CachingConnectionFactory connectionfactory = new CachingConnectionFactory();
connectionfactory.setTargetConnectionFactory(sqsConnectionFactory);
// Set the #connectionfactory properties to your liking here...
return connectionFactory;
}
If you want something more fancy as a JMS pooling solution (which will pool Connections and MessageProducers for you in addition to multiple Sessions), you can use the reasonably new PooledJMS project's JmsPoolConnectionFactory, or the like, from their library.

Play framework resource starvation after a few days

I am experiencing an issue in Play 2.5.8 (Java) where database related service endpoints starts timing out after a few days even though the server CPU & memory usage seems fine. Endpoints that does not access the DB continue to work perfectly.
The application runs on a t2.medium EC2 instance with a t2.medium MySQL RDS, both in the same availability zone. Most HTTP calls do lookups/updates to the database with around 8-12 requests per second, and there are also ±800 WebSocket connections/actors with ±8 requests/second (90% of the WebSocket messages does not access the database). DB operations are mostly simple lookups & updates taking around 100ms.
When using only the default thread pool it took about 2 days to reach the deadlock, and after moving the database requests to a separate thread pool as per https://www.playframework.com/documentation/2.5.x/ThreadPools#highly-synchronous, it improved but only to about 4 days.
This is my current thread config in application.conf:
akka {
actor {
guardian-supervisor-strategy = "actors.RootSupervisionStrategy"
}
loggers = ["akka.event.Logging$DefaultLogger",
"akka.event.slf4j.Slf4jLogger"]
loglevel = WARNING
## This pool handles all HTTP & WebSocket requests
default-dispatcher {
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 64
}
}
db-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 210
}
}
}
Database configuration:
play.db.pool="default"
play.db.prototype.hikaricp.maximumPoolSize=200
db.default.driver=com.mysql.jdbc.Driver
I have played around with the amount of connections in the DB pool & adjusting the size of the default & db-dispatcher pool size but it doesn't seem to make any difference. It feels I'm missing something fundamental about Play's thread pools & configuration as I don't think the load on the server should not be an issue for Play to handle.
After more investigation I found that the issue is not related to thread pool configuration at all, but rather TCP connections that build up due to WS reconnections until the server (or Play framework) cannot accept any more connections. When this happens, only established TCP connections are serviced which mostly includes the established WebSocket connections.
I could not yet determine why the connections are not managed/closed properly.
My issue relates to this question:
Play 2.5 WebSocket Connection Build

Hazelcast 3.5 client fails to reconnect after all server nodes went down

I would like to ask you for a help. I have a problem with client reconnection to a HZ cluster in case whole cluster goes down and then gets up.
If HZ cluster keeps at least one node alive, there is no problem at all. But when all nodes goes down, related client never reconnect again.
Here's my client initialization code:
public void init() {
List<String> members = (List<String>)Arrays.asList("node1,node2,node3".split(","));
ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().
setName( "dev" ).
setPassword( "dev" );
clientConfig.getNetworkConfig().
setConnectionAttemptLimit( 10 ).
setConnectionAttemptPeriod( 100 ).
setConnectionAttemptPeriod( 1000 ).
setRedoOperation(true).
setSmartRouting(true).
setAddresses(members);
hzClient = HazelcastClient.newHazelcastClient(clientConfig);
hzSessionMap = hzClient.getMap("mapsession");
}
public IMap getSessionMap() {
return hzSessionMap;
}
I have my client created as spring bean, which offers IMap objects.
Exception thrown when reconnect after whole cluster is restarted is:
com.hazelcast.nio.serialization.HazelcastSerializationException:
com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
at
com.hazelcast.nio.serialization.SerializationServiceImpl.handleException(SerializationServiceImpl.java:380) ~[hazelcast-3.5.jar:3.5]
at com.hazelcast.nio.serialization.SerializationServiceImpl.toData(SerializationServiceImpl.java:235) ~[hazelcast-3.5.jar:3.5]
at
com.hazelcast.nio.serialization.SerializationServiceImpl.toData(SerializationServiceImpl.java:207) ~[hazelcast-3.5.jar:3.5]
at com.hazelcast.client.spi.ClientProxy.toData(ClientProxy.java:169) ~[hazelcast-client-3.5.jar:3.5]
at com.hazelcast.client.proxy.ClientMapProxy.put(ClientMapProxy.java:362) ~[hazelcast-client-3.5.jar:3.5]
...
Regards,
Josef

How to balance the load when building cache from App Engine?

I currently have the following situation, which has bothered me for a couple of months right now.
The case
I have build a Java (FX) application which serves as a cash registry for my shop. The application contains a lot of classes (such as Customer, Customer, Transaction etc.), which are shared with the server API. The server API is hosted on Google App Engine.
Because we also have an online shop, I have chosen to build the cache of the entire database on startup of the application. To do this I call the GET of my Data API for each class/table:
protected QueryBuilder performGet(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException, ApiException, JSONException {
Connection conn = connectToCloudSQL();
log.info("Parameters: "+Functions.parameterMapToString(req.getParameterMap()));
String tableName = this.getTableName(req);
log.info("TableName: "+tableName);
GetQueryBuilder queryBuilder = DataManager.executeGet(conn, req.getParameterMap(), tableName, null);
//Get the correct method to create the objects
String camelTableName = Functions.snakeToCamelCase(tableName);
String parsedTableName = Character.toUpperCase(camelTableName.charAt(0)) + camelTableName.substring(1);
List<Object> objects = new ArrayList<>();
try {
log.info("Parsed Table Name: "+parsedTableName);
Method creationMethod = ObjectManager.class.getDeclaredMethod("create"+parsedTableName, ResultSet.class, boolean.class);
while (queryBuilder.getResultSet().next()) {
//Create new objects with the ObjectManager
objects.add(creationMethod.invoke(null, queryBuilder.getResultSet(), false));
}
log.info("List of objects created");
creationMethod = null;
}
catch (Exception e) {
camelTableName = null;
parsedTableName = null;
objects = null;
throw new ApiException(e, "Something went wrong while iterating through ResultSet.", ErrorStatus.NOT_VALID);
}
Functions.listOfObjectsToJson(objects, res.getOutputStream());
log.info("GET Request succeeded");
//Clean up objects
camelTableName = null;
parsedTableName = null;
objects = null;
closeConnection(conn);
return queryBuilder;
}
It simples gets every row from the requested table in my Cloud SQL database. Then it creates the objects, with the class that is shared with the client application. Lastly, it converts these classes to JSON using GSON. Some of my tables have 10.000+ rows, and then it takes approx. 5-10 sec to do this.
At the client, I convert this JSON back to a list of objects by using the same shared class. First I load the essential classes sequentially (because else the application won't start), and after that I load the rest of the classes in the background with separate threads.
The problem
Every time I load up the cache, there is a chance (1 on 4) that the server responds with a DeadlineExceededException on some of the bigger tables. I believe this has something to do with Google App Engine not being able to fire up a new instance in time, and therefore the computation time exceeds the limit.
I know it has something to do with loading the objects in background threads, because these all start at the same time. When I delay the start of these threads with 3 seconds, the error occurs a lot less, but is still present. Because the application loads 15 classes in the background, delaying them is not ideal because the application will only work partly until it is done. It is also not an option to load everything before starting, because this will take more than 2 minutes.
Does anyone know how to set up some load balancing on Google App Engine for this? I would like to solve this server side.
You clearly have an issue with warm up requests and a query that takes quite long. You have the usual options:
Do some profiling and reduce the cost of your method invocations
use caching (memcache) to cache some of the result
If those options don't work for you, you should parallelize your computations. One thing that comes to my mind is that you could reliably reduce request times if you simply split your request into multiple parallel requests like so:
Let's say your table contains 5k rows.
Then you create 50 requests with each handleing 100 rows.
Aggregate the results on server or client side and respond
It'll be quite tough to do this on just the server side but it should be possible if your now (much) smaller taks return within a couple of seconds.
Alternatively you could return a job id at once and make the client poll for the result in a couple of seconds. This would however require a small change on the client side. It's the better option though imho, especially if you want to use a task queue for creating your response.

Categories

Resources