Problem at creating EMS application supporting Failover/FaultTolerance

Problem at creating EMS application supporting Failover/FaultTolerance - java

I am starting to study how can I implement an application supporting Failover/FaultTolerance on top of JMS, more precisely EMS
I configured two EMS servers working both with FaultTolerance enabled:
For EMS running on server on server1 I have
in tibemsd.conf
ft_active = tcp://server2:7232
in factories.conf
[GenericConnectionFactory]
type = generic
url = tcp://server1:7232
[FTTopicConnectionFactory]
type = topic
url = tcp://server1:7232,tcp://server2:7232
[FTQueueConnectionFactory]
type = queue
url = tcp://server1:7232,tcp://server2:7232
And for EMS running on server on server2 I have
in tibemsd.conf
ft_active = tcp://server1:7232
in factories.conf
[GenericConnectionFactory]
type = generic
url = tcp://server2:7232
[FTTopicConnectionFactory]
type = topic
url = tcp://server2:7232,tcp://server1:7232
[FTQueueConnectionFactory]
type = queue
url = tcp://server2:7232,tcp://server1:7232
I am not a TIBCO EMS expert but my config seems to be good: When I start EMS on server1 I get:
$ tibemsd -config tibemsd.conf
...
2022-07-20 23:04:58.566 Server is active.
2022-07-20 23:05:18.563 Standby server 'SERVERNAME#server1' has connected.
then if I start EMS on server2, I get
$ tibemsd -config tibemsd.conf
...
2022-07-20 23:05:18.564 Accepting connections on tcp://server2:7232.
2022-07-20 23:05:18.564 Server is in standby state for 'tcp://server1:7232'
Moreover, if I kill active EMS on server1, I immediately get the following message on server2:
2022-07-20 23:21:52.891 Connection to active server 'tcp://server1:7232' has been lost.
2022-07-20 23:21:52.891 Server activating on failure of 'tcp://server1:7232'.
...
2022-07-20 23:21:52.924 Server is now active.
Until here, everything looks OK, active/standby EMS servers seems to be correctly configured
Things get more complicated when I write a piece of code how is supposed to connect to these EMS servers and to periodically publish messages. Let's try with the following code sample:
#Test
public void testEmsFailover() throws JMSException, InterruptedException {
int NB = 1000;
TibjmsConnectionFactory factory = new TibjmsConnectionFactory();
factory.setServerUrl("tcp://server1:7232,tcp://server2:7232");
Connection connection = factory.createConnection();
Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
connection.start();
for (int i = 0; i < NB; i++) {
LOG.info("sending message");
Queue queue = session.createQueue(QUEUE__CLIENT_TO_FRONTDOOR__CONNECTION_REQUEST);
MessageProducer producer = session.createProducer(queue);
MapMessage mapMessage = session.createMapMessage();
mapMessage.setStringProperty(PROPERTY__CLIENT_KIND, USER.toString());
mapMessage.setStringProperty(PROPERTY__CLIENT_NAME, "name");
producer.send(mapMessage);
LOG.info("done!");
Thread.sleep(1000);
}
}
If I run this code while both active and standby servers are up, everything looks good
23:26:32.431 [main] INFO JmsEndpointTest - sending message
23:26:32.458 [main] INFO JmsEndpointTest - done!
23:26:33.458 [main] INFO JmsEndpointTest - sending message
23:26:33.482 [main] INFO JmsEndpointTest - done!
Now If I kill the active EMS server, I would expect that
the standby server would instantaneously become the active one
my code would continue to publish such as if nothing had happened
However, in my code I get the following error:
javax.jms.JMSException: Connection is closed
at com.tibco.tibjms.TibjmsxLink.sendRequest(TibjmsxLink.java:307)
at com.tibco.tibjms.TibjmsxLink.sendRequestMsg(TibjmsxLink.java:261)
at com.tibco.tibjms.TibjmsxSessionImp._createProducer(TibjmsxSessionImp.java:1004)
at com.tibco.tibjms.TibjmsxSessionImp.createProducer(TibjmsxSessionImp.java:4854)
at JmsEndpointTest.testEmsFailover(JmsEndpointTest.java:103)
...
and in the logs of the server (the previous standby server supposed to be now the active one) I get
2022-07-20 23:32:44.447 [anonymous#cersei]: connect failed: server not in active state
2022-07-20 23:33:02.969 Connection to active server 'tcp://server2:7232' has been lost.
2022-07-20 23:33:02.969 Server activating on failure of 'tcp://server2:7232'.
2022-07-20 23:33:02.969 Server rereading configuration.
2022-07-20 23:33:02.971 Recovering state, please wait.
2022-07-20 23:33:02.980 Recovered 46 messages.
2022-07-20 23:33:02.980 Server is now active.
2022-07-20 23:33:03.545 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:04.187 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:04.855 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:05.531 [anonymous#cersei]: reconnect failed: connection unknown for id=8
I would appreciate any help to enhance my code
Thank you

I think I found the origin of my problem:
according to the page Tibco-Ems Failover Issue, the error message
reconnect failed: connection unknown for id=8
means: "the store (ems db) was'nt share between the active and the standby node, so when the active ems failed, the new active ems was'nt able to recover connections and messages."
I realized that it is painful to configure a shared store. To avoid it, I configured two tibems on the same host, by following the page Step By Step How to Setup TIBCO EMS In Fault Tolerant Mode:
two tibemsd.conf configuration files
configure a different listen port in each file
configure ft_active with url of other server
configure factories.conf
By doing so, I can replay my test and it works as expected

Related

Customize the automatic reconnection settings to IBM MQ

I have written a code to connect to IBM MQ, and i am using ConnectionNameList which automatically reconnects to ibm mq.
I want to customize the reconnection which is happening implicitly. I have referred many articles on the internet but i am not able to figure it out.
This is my Queue Manager Config:
#Configuration
public class QM1Config{
public String queueManager;
public String queue;
public String channel;
public String connName;
public String user;
public String password;
private static final int RECONNECT_TIMEOUT = 10;
#Autowired
MQService config;
#Bean
public MQConnectionFactory mqQueueConnectionFactory() {
this.channel = config.getHosts().get(0).getChannel();
this.user = config.getHosts().get(0).getUser();
this.password = config.getHosts().get(0).getPassword();
this.queueManager = config.getHosts().get(0).getQueueManager();
this.queue = config.getHosts().get(0).getQueue();
this.connName = config.getHosts().get(0).getConnName();
System.out.println(channel+" "+connName+" "+queueManager+" "+user);
MQConnectionFactory mqQueueConnectionFactory = new MQConnectionFactory();
try {
mqQueueConnectionFactory.setTransportType(WMQConstants.WMQ_CM_CLIENT);
mqQueueConnectionFactory.setBooleanProperty(WMQConstants.USER_AUTHENTICATION_MQCSP, false);
mqQueueConnectionFactory.setCCSID(1208);
mqQueueConnectionFactory.setChannel(channel);
mqQueueConnectionFactory.setStringProperty(WMQConstants.USERID, user);
mqQueueConnectionFactory.setStringProperty(WMQConstants.PASSWORD, password);
mqQueueConnectionFactory.setQueueManager(queueManager);
mqQueueConnectionFactory.setConnectionNameList(connName);
} catch (Exception e) {
e.printStackTrace();
}
return mqQueueConnectionFactory;
}
#Bean
public JmsListenerContainerFactory<?> qm1JmsListenerContainerFactory(#Qualifier("mqQueueConnectionFactory") MQConnectionFactory mqQueueConnectionFactory, DefaultJmsListenerContainerFactoryConfigurer configurer) throws InterruptedException {
DefaultJmsListenerContainerFactory factory = new DefaultJmsListenerContainerFactory();
this.queue = config.getHosts().get(0).getQueue();
configurer.configure(factory, mqQueueConnectionFactory);
return factory;
}
#Bean("jmsTemplate1")
public JmsTemplate jmsTemplate(#Qualifier("mqQueueConnectionFactory") MQConnectionFactory mqQueueConnectionFactory) {
JmsTemplate jmsTemplate1 = new JmsTemplate(mqQueueConnectionFactory);
return jmsTemplate1;
}
}
When i stop the queue manager, i get the following exceptions every 5 seconds:
2022-04-24 01:17:43.194 WARN 6644 --- [enerContainer-1] o.s.j.l.DefaultMessageListenerContainer : Setup of JMS message listener invoker failed for destination 'Q5' - trying to recover. Cause: JMSWMQ2002: Failed to get a message from destination 'Q5'.; nested exception is com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with compcode '2' ('MQCC_FAILED') reason '2009' ('MQRC_CONNECTION_BROKEN').
2022-04-24 01:17:43.232 ERROR 6644 --- [enerContainer-1] o.s.j.l.DefaultMessageListenerContainer : Could not refresh JMS Connection for destination 'Q5' - retrying using FixedBackOff{interval=5000, currentAttempts=0, maxAttempts=unlimited}. Cause: JMSWMQ0018: Failed to connect to queue manager 'QM5' with connection mode 'Client' and host name 'Client'.; nested exception is com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with compcode '2' ('MQCC_FAILED') reason '2059' ('MQRC_Q_MGR_NOT_AVAILABLE').
2022-04-24 01:17:48.243 ERROR 6644 --- [enerContainer-1] o.s.j.l.DefaultMessageListenerContainer : Could not refresh JMS Connection for destination 'Q5' - retrying using FixedBackOff{interval=5000, currentAttempts=1, maxAttempts=unlimited}. Cause: JMSWMQ0018: Failed to connect to queue manager 'QM5' with connection mode 'Client' and host name 'Client'.; nested exception is com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with compcode '2' ('MQCC_FAILED') reason '2538' ('MQRC_HOST_NOT_AVAILABLE').
2022-04-24 01:17:53.245 ERROR 6644 --- [enerContainer-1] o.s.j.l.DefaultMessageListenerContainer : Could not refresh JMS Connection for destination 'Q5' - retrying using FixedBackOff{interval=5000, currentAttempts=2, maxAttempts=unlimited}. Cause: JMSWMQ0018: Failed to connect to queue manager 'QM5' with connection mode 'Client' and host name 'Client'.; nested exception is com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with compcode '2' ('MQCC_FAILED') reason '2538' ('MQRC_HOST_NOT_AVAILABLE').
2022-04-24 01:17:58.250 ERROR 6644 --- [enerContainer-1] o.s.j.l.DefaultMessageListenerContainer : Could not refresh JMS Connection for destination 'Q5' - retrying using FixedBackOff{interval=5000, currentAttempts=3, maxAttempts=unlimited}. Cause: JMSWMQ0018: Failed to connect to queue manager 'QM5' with connection mode 'Client' and host name 'Client'.; nested exception is com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with compcode '2' ('MQCC_FAILED') reason '2538' ('MQRC_HOST_NOT_AVAILABLE').
So i want that first 3 reconnection attempts should be a warn messages instead of error message as shown in above logs and 4th attempt onwards i want it to be an error message. and reconnection attempt to be every 10/15 seconds.
How do i configure these re-connection settings ?
Any help would be greatly appreciated ! Thanks !
EDIT: I have added an exception listener as follows:
public class MQExceptionListener implements ExceptionListener {
private static final Logger LOGGER = LoggerFactory.getLogger(MQExceptionListener.class);
int count = -1;
#Override
public void onException(JMSException ex) {
if(count > 2) {
System.out.println("COUNT - "+ count);
count++;
LOGGER.error("***********************************************");
LOGGER.error(ex.toString()+" THIS IS EX TO STRING");
if (ex.getLinkedException() != null) {
LOGGER.error(ex.getLinkedException().toString()+" THIS IS getLinkedException TO STRING");
}
LOGGER.error("================================================");
}else {
System.out.println("COUNT - "+ count);
count++;
LOGGER.warn("***********************************************");
LOGGER.warn(ex.toString()+" THIS IS EX TO STRING");
if (ex.getLinkedException() != null) {
LOGGER.warn(ex.getLinkedException().toString()+" THIS IS getLinkedException TO STRING");
}
LOGGER.warn("================================================");
}
}
}
Now My logs are as follows:
COUNT - 1
2022-04-24 14:41:04.905 WARN 9268 --- [enerContainer-1] com.mq.sslMQ.MQExceptionListener : ***********************************************
2022-04-24 14:41:04.905 WARN 9268 --- [enerContainer-1] com.mq.sslMQ.MQExceptionListener : com.ibm.msg.client.jms.DetailedIllegalStateException: JMSWMQ0018: Failed to connect to queue manager 'QM5' with connection mode 'Client' and host name 'Client'.
Check the queue manager is started and if running in client mode, check there is a listener running. Please see the linked exception for more information. THIS IS EX TO STRING
2022-04-24 14:41:04.905 WARN 9268 --- [enerContainer-1] com.mq.sslMQ.MQExceptionListener : com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with compcode '2' ('MQCC_FAILED') reason '2538' ('MQRC_HOST_NOT_AVAILABLE'). THIS IS getLinkedException TO STRING
2022-04-24 14:41:04.905 WARN 9268 --- [enerContainer-1] com.mq.sslMQ.MQExceptionListener : ================================================
2022-04-24 14:41:04.905 ERROR 9268 --- [enerContainer-1] o.s.j.l.DefaultMessageListenerContainer : Could not refresh JMS Connection for destination 'Q5' - retrying using FixedBackOff{interval=5000, currentAttempts=1, maxAttempts=unlimited}. Cause: JMSWMQ0018: Failed to connect to queue manager 'QM5' with connection mode 'Client' and host name 'Client'.; nested exception is com.ibm.mq.MQException: JMSCMQ0001: IBM MQ call failed with compcode '2' ('MQCC_FAILED') reason '2538' ('MQRC_HOST_NOT_AVAILABLE').
I dont want the default message listener container log to be printed onto console. how do we achieve that ?

It says in IBM Docs:-
By default, the reconnection attempts happen at the following intervals:
The first attempt is made after an initial delay of 1 second, plus a random element up to 250 milliseconds.
The second attempt is made 2 seconds, plus a random interval of up to 500 milliseconds, after the first attempt fails.
The third attempt is made 4 seconds, plus a random interval of up to 1 second, after the second attempt fails.
The fourth attempt is made 8 seconds, plus a random interval of up to 2 seconds, after the third attempt fails.
The fifth attempt is made 16 seconds, plus a random interval of up to 4 seconds, after the fourth attempt fails.
The sixth attempt, and all subsequent attempts are made 25 seconds, plus a random interval of up to 6 seconds and 250 milliseconds after the previous attempt fails.
The reconnection attempts are delayed by intervals that are partly fixed and partly random. This is to prevent all of the IBM MQ classes for JMS applications that were connected to a queue manager that is no longer available from reconnecting simultaneously.
If you need to increase the default values, to more accurately reflect the amount of time that is required for a queue manager to recover, or a standby queue manager to become active, modify the ReconDelay attribute in the Channel stanza of the client configuration file.
You can read more about this attribute here.
Sounds like you need to put the following into your mqclient.ini file.
CHANNELS:
ReconDelay=(10000,5000)
That is requesting a delay of 10 seconds plus a random element up to 5 seconds, which is my interpretation of your request for 10/15 seconds. You haven't asked for any of the reconnection attempts to be different in timing, although you can do this if you need to.
Note, it is not possible to change the WARN/ERROR status of the messages.
Remember that you can always turn off automatic reconnect and implement whatever you need yourself by catching the connection failures in your application. Automatic reconnect was designed for applications that were unable (or unwilling) to catch the connection failures.

I would add that the 5 second interval reconnect attempts are the DefaultMessageListenerContainer trying to reconnect. The default reconnect interval is 5 seconds - DEFAULT_RECOVERY_INTERVAL, so I'm not thinking that this involves the MQ reconnect mechanism.
With the exception handler listed above in place, you could programatically change the DefaultMessageListenerContainer setRecoveryInterval() or use setBackOff() to control the backoff interval.
As to disabling the logging, setting the log level for the DefaultMessageListenerContainer to FATAL should do it.

java.net.ConnectException: Connection refused timeout

I saw a lot of "java.net.ConnectException: Connection refused" questions but none referring to timeout of this error. My problem is I have to connect to a server that, in some cases, is blocked (connected by another software to the same port). So, I'm doing a loop with some max retries to try to connect:
My current code (of course, is depending on a lot of configurations for my software, but is working fine):
public TCPConnector(TCPDefinition tcpDefinition) throws IAException {
ivTcpDefinition = tcpDefinition;
// Initialize the socket
boolean retry = false;
int counter = 1;
do {
try {
ivSocket = new Socket();
ivSocket.connect(new InetSocketAddress(tcpDefinition.getHostname(), tcpDefinition.getPort()), tcpDefinition.getConnectTimeOut());
ivSocket.setSoTimeout(tcpDefinition.getAckTimeOut());
retry = false;
}
catch (UnknownHostException uhe) {
throw new IAException(null, new StringBuffer("Can't find host: ").append(tcpDefinition.getHostname()).toString(), uhe);
}
catch (SocketException see) {
StringBuilder sb = new StringBuilder("Connection refused to host ").append(tcpDefinition.getHostname()).
append(" port ").append(tcpDefinition.getPort()).append(". Connection Attempt Nr. ").append(counter);
logger.error(sb.toString(), see);
retry = true;
if (counter++ > tcpDefinition.getConnectRetries())
throw new IAException(null, sb.toString(), see);
else
logger.error("will retry to connect");
}
catch (IOException ioe) {
StringBuilder sb = new StringBuilder("I/O error while connecting to host ").append(tcpDefinition.getHostname()).
append(" port ").append(tcpDefinition.getPort()).append(". Connection Attempt Nr. ").append(counter);
logger.error(sb.toString(), ioe);
retry = true;
if (counter++ > tcpDefinition.getConnectRetries())
throw new IAException(null, sb.toString(), ioe);
else
logger.error("will retry to connect");
}
}
while (retry);
}
Well, the problem is this:
On Windows, every second, the SocketException is thrown, instead the IOException, while I have configured a timeout of 5000 msec to ivSocket.connect
On Linux, this is thrown every millisecond!!
Windows:
2019-12-05 12:40:47,609 ERROR DefaultQuartzScheduler_Worker-1 TCPConnector - Connection refused to host localhost port 13002. Connection Attempt Nr. 1
java.net.ConnectException: Connection refused: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
2019-12-05 12:40:48,703 ERROR DefaultQuartzScheduler_Worker-1 TCPConnector - Connection refused to host localhost port 13002. Connection Attempt Nr. 2
java.net.ConnectException: Connection refused: connect
Linux:
2019-12-05 12:45:47,609 ERROR DefaultQuartzScheduler_Worker-1 TCPConnector - Connection refused to host localhost port 13002. Connection Attempt Nr. 1
java.net.ConnectException: Connection refused: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
2019-12-05 12:45:47,610 ERROR DefaultQuartzScheduler_Worker-1 TCPConnector - Connection refused to host localhost port 13002. Connection Attempt Nr. 2
java.net.ConnectException: Connection refused: connect
Why the timeout is not executed? Well this is not exactly right. If I configure a timeout less than 1 second on Windows, then the timeout is executed. 500 msec:
2019-12-05 11:47:07,375 ERROR DefaultQuartzScheduler_Worker-1 TCPConnector - I/O error while connecting to host localhost port 13002. Connection Attempt Nr. 1
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
2019-12-05 11:47:07,875 ERROR DefaultQuartzScheduler_Worker-1 TCPConnector - I/O error while connecting to host localhost port 13002. Connection Attempt Nr. 2
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
It is possible to configure a "connect refuse" timeout?

There is no such thing as a "connection refused timeout".
"Connection refused" happens when the server sees the connection request, but there is no service listening for connections on the IP + port that the request is directed to. The server then "refuses" the connection. This typically happens instantly, so so no timeout is triggered.
"Connection timed out" happens (typically) when something stops the connection request from reaching the server1, 2. So the client-side will wait for the response from the server, and then resend / wait a few times. And eventually the time allotted for establishing a connection will expire ... and the connection times out.
As you can see these are different scenarios. And they are reported back to the Java client-side differently.
So the reason you are not getting timeouts is that the "connection refused" responses are coming back quick enough that your configured timeout is not exceeded.
That might also explain why setting the connect timeout small might have changed the behavior. There may also be issues with the granularity of the timeout that the OS allows Java to set.
To investigate this further, I think we would need a minimal reproducible example. For example, we need to see how you have implemented the code that manages the server-socket and accepts connections on the server side.
1 - The blockage could be on the server's reply packets.
2 - There are various possible causes for this kind of thing. The most likely are a firewall blocking traffic somewhere, a network routing problem, or using a private IP address on the wrong network.

Trouble with Glassfish Server and ActiveMQ: peer did not send his wire format

I'm getting this error while trying to set up a JMSPublisher and JMSSubscriber
jndi.properties
java.naming.factory.initial = org.apache.activemq.jndi.ActiveMQInitialContextFactory
java.naming.provider.url = tcp://localhost:4848?wireFormat.maxInactivityDurationInitalDelay=30000
topic.topic/flightStatus = flightStatus
Glassfish server is running on: http://localhost:4848
Publisher:
JmsPublisher publisher= new JmsPublisher("ConnectionFactory", "topic/flightStatus");
...
public JmsPublisher(String factoryName, String topicName) throws JMSException, NamingException {
Context jndiContext = new InitialContext();
TopicConnectionFactory factory = (TopicConnectionFactory) jndiContext.lookup(factoryName);
Topic topic = (Topic) jndiContext.lookup(topicName);
this.connect = factory.createTopicConnection();
this.session = connect.createTopicSession(false, Session.AUTO_ACKNOWLEDGE);
this.publisher = session.createPublisher(topic);
}
Exception:
Exception in thread "main" javax.jms.JMSException: Wire format negotiation timeout: peer did not send his wire format.
at org.apache.activemq.util.JMSExceptionSupport.create(JMSExceptionSupport.java:62)
at org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1395)
at org.apache.activemq.ActiveMQConnection.ensureConnectionInfoSent(ActiveMQConnection.java:1481)
at org.apache.activemq.ActiveMQConnection.createSession(ActiveMQConnection.java:323)
at org.apache.activemq.ActiveMQConnection.createTopicSession(ActiveMQConnection.java:1112)
at com.mycompany.testejms.JmsPublisher.<init>(JmsPublisher.java:34)
at com.mycompany.testejms.JmsPublisher.main(JmsPublisher.java:51)
Caused by: java.io.IOException: Wire format negotiation timeout: peer did not send his wire format.
at org.apache.activemq.transport.WireFormatNegotiator.oneway(WireFormatNegotiator.java:98)
at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
at org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
at org.apache.activemq.transport.ResponseCorrelator.request(ResponseCorrelator.java:86)
at org.apache.activemq.ActiveMQConnection.syncSendPacket(ActiveMQConnection.java:1366)
... 5 more

The error indicates that the ActiveMQ client is not actually communicating with an ActiveMQ broker. Glassfish may be listening on http://localhost:4848, but apparently that's not where the ActiveMQ broker is listening for connections. From what I understand, port 4848 is where the Glassfish web admin console listens for connections. Note the http in the URL you provided. By default, ActiveMQ listens on port 61616.

Tibjms javax.jms.JMSException: Connection unknown by server

I am using Tibjms jar for JMS connection and it works fine in normal case but I have problem in case the connection to jms provider is lost and then it comes back. To reproduce the issue, I performed the following steps -
Connect to intranet and start the server. Works fine.
Disconnect from intranet. It starts trying reconnecting the server. Fine.
Connect again to intranet. It throws unknown exception and never connects again. Problem.
So, my problem is "javax.jms.JMSException: Connection unknown by server" which does not tell me much and you can see it at the end of logs.
You can see it from the following logs -
2017-10-13 15:40:52,333 [ http-nio-8080-exec-2] INFO org.springframework.web.servlet.DispatcherServlet - FrameworkServlet 'dispatcherServlet': initialization completed in 37 ms
2017-10-13 15:41:29,293 [k Reader (Server-3285015)] ERROR com.example.jms.PaxJmsClient - Exception received from jms
javax.jms.JMSException: Disconnected from ssl://10.10.10.10:5071, will attempt to reconnect
at com.tibco.tibjms.TibjmsConnection._invokeOnExceptionCallback(TibjmsConnection.java:2132)
at com.tibco.tibjms.TibjmsConnection._reconnect(TibjmsConnection.java:1912)
at com.tibco.tibjms.TibjmsConnection$ServerLinkEventHandler.onEventReconnect(TibjmsConnection.java:387)
at com.tibco.tibjms.TibjmsxLinkTcp._doReconnect(TibjmsxLinkTcp.java:598)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.work(TibjmsxLinkTcp.java:317)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.run(TibjmsxLinkTcp.java:259)
2017-10-13 15:42:29,334 [k Reader (Server-3285015)] ERROR com.example.jms.PaxJmsClient - Exception received from jms
javax.jms.JMSException: Reconnecting to ssl://11.11.11.11:5071, attempt 1 out of 100
at com.tibco.tibjms.TibjmsConnection._invokeOnExceptionCallback(TibjmsConnection.java:2132)
at com.tibco.tibjms.TibjmsConnection._reconnect(TibjmsConnection.java:1975)
at com.tibco.tibjms.TibjmsConnection$ServerLinkEventHandler.onEventReconnect(TibjmsConnection.java:387)
at com.tibco.tibjms.TibjmsxLinkTcp._doReconnect(TibjmsxLinkTcp.java:598)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.work(TibjmsxLinkTcp.java:317)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.run(TibjmsxLinkTcp.java:259)
2017-10-13 15:42:32,335 [k Reader (Server-3285015)] ERROR com.example.jms.PaxJmsClient - Exception received from jms
javax.jms.JMSException: Reconnecting to ssl://10.10.10.10:5071, attempt 1 out of 100
at com.tibco.tibjms.TibjmsConnection._invokeOnExceptionCallback(TibjmsConnection.java:2132)
at com.tibco.tibjms.TibjmsConnection._reconnect(TibjmsConnection.java:1975)
at com.tibco.tibjms.TibjmsConnection$ServerLinkEventHandler.onEventReconnect(TibjmsConnection.java:387)
at com.tibco.tibjms.TibjmsxLinkTcp._doReconnect(TibjmsxLinkTcp.java:598)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.work(TibjmsxLinkTcp.java:317)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.run(TibjmsxLinkTcp.java:259)
2017-10-13 15:43:35,358 [k Reader (Server-3285015)] ERROR com.example.jms.PaxJmsClient - Exception received from jms
javax.jms.JMSException: Reconnecting to ssl://11.11.11.11:5071, attempt 2 out of 100
at com.tibco.tibjms.TibjmsConnection._invokeOnExceptionCallback(TibjmsConnection.java:2132)
at com.tibco.tibjms.TibjmsConnection._reconnect(TibjmsConnection.java:1975)
at com.tibco.tibjms.TibjmsConnection$ServerLinkEventHandler.onEventReconnect(TibjmsConnection.java:387)
at com.tibco.tibjms.TibjmsxLinkTcp._doReconnect(TibjmsxLinkTcp.java:598)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.work(TibjmsxLinkTcp.java:317)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.run(TibjmsxLinkTcp.java:259)
2017-10-13 15:43:38,359 [k Reader (Server-3285015)] ERROR com.example.jms.PaxJmsClient - Exception received from jms
javax.jms.JMSException: Reconnecting to ssl://10.10.10.10:5071, attempt 2 out of 100
at com.tibco.tibjms.TibjmsConnection._invokeOnExceptionCallback(TibjmsConnection.java:2132)
at com.tibco.tibjms.TibjmsConnection._reconnect(TibjmsConnection.java:1975)
at com.tibco.tibjms.TibjmsConnection$ServerLinkEventHandler.onEventReconnect(TibjmsConnection.java:387)
at com.tibco.tibjms.TibjmsxLinkTcp._doReconnect(TibjmsxLinkTcp.java:598)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.work(TibjmsxLinkTcp.java:317)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.run(TibjmsxLinkTcp.java:259)
2017-10-13 15:44:41,368 [k Reader (Server-3285015)] ERROR com.example.jms.PaxJmsClient - Exception received from jms
javax.jms.JMSException: Reconnecting to ssl://11.11.11.11:5071, attempt 3 out of 100
at com.tibco.tibjms.TibjmsConnection._invokeOnExceptionCallback(TibjmsConnection.java:2132)
at com.tibco.tibjms.TibjmsConnection._reconnect(TibjmsConnection.java:1975)
at com.tibco.tibjms.TibjmsConnection$ServerLinkEventHandler.onEventReconnect(TibjmsConnection.java:387)
at com.tibco.tibjms.TibjmsxLinkTcp._doReconnect(TibjmsxLinkTcp.java:598)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.work(TibjmsxLinkTcp.java:317)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.run(TibjmsxLinkTcp.java:259)
2017-10-13 15:44:45,951 [k Reader (Server-3285015)] ERROR com.example.jms.PaxJmsClient - Exception received from jms
javax.jms.JMSException: Reconnecting to ssl://10.10.10.10:5071, attempt 3 out of 100
at com.tibco.tibjms.TibjmsConnection._invokeOnExceptionCallback(TibjmsConnection.java:2132)
at com.tibco.tibjms.TibjmsConnection._reconnect(TibjmsConnection.java:1975)
at com.tibco.tibjms.TibjmsConnection$ServerLinkEventHandler.onEventReconnect(TibjmsConnection.java:387)
at com.tibco.tibjms.TibjmsxLinkTcp._doReconnect(TibjmsxLinkTcp.java:598)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.work(TibjmsxLinkTcp.java:317)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.run(TibjmsxLinkTcp.java:259)
2017-10-13 15:44:50,525 [k Reader (Server-3285015)] ERROR com.example.jms.PaxJmsClient - Exception received from jms
javax.jms.JMSException: Connection unknown by server
at com.tibco.tibjms.Tibjmsx.buildException(Tibjmsx.java:659)
at com.tibco.tibjms.TibjmsConnection._invokeOnExceptionCallback(TibjmsConnection.java:2114)
at com.tibco.tibjms.TibjmsConnection._onDisconnected(TibjmsConnection.java:2487)
at com.tibco.tibjms.TibjmsConnection$ServerLinkEventHandler.onEventDisconnected(TibjmsConnection.java:367)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.work(TibjmsxLinkTcp.java:328)
at com.tibco.tibjms.TibjmsxLinkTcp$LinkReader.run(TibjmsxLinkTcp.java:259)
My code -
#PostConstruct
public void configurePaxJmsClient() {
try {
// create Topic Connection Factory
TibjmsTopicConnectionFactory cf = new TibjmsTopicConnectionFactory(serverUrl);
cf.setSSLTrustedCertificate(sslCertificatePath);
cf.setSSLEnableVerifyHostName(false);
cf.setUserName(username);
cf.setUserPassword(password);
cf.setReconnAttemptCount(100);
cf.setReconnAttemptDelay(60000);
cf.setReconnAttemptTimeout(10000);
cf.setConnAttemptCount(100);
cf.setConnAttemptDelay(60000);
cf.setConnAttemptTimeout(10000);
Tibjms.setExceptionOnFTEvents(true);
Tibjms.setExceptionOnFTSwitch(true);
// creation the connection and install an exception handler
connection = cf.createTopicConnection(username, password);
connection.setExceptionListener(this);
// You might also use CLIENT_ACKNOWLEDGE here
session = connection.createTopicSession(false, javax.jms.Session.AUTO_ACKNOWLEDGE);
Topic topic = session.createTopic(topicName);
// Create the subscriber and install the listener
TopicSubscriber ts;
/*if (dsName == null || dsName.length() == 0) {
ts = session.createSubscriber(topic);
} else {
ts = session.createDurableSubscriber(topic, dsName);
}*/
if (dsName == null || dsName.length() == 0) {
ts = session.createSubscriber(topic, messageSelector, false);
} else {
ts = session.createDurableSubscriber(topic, dsName, messageSelector, false);
}
//
ts.setMessageListener(this);
connection.start();
} catch (JMSException e) {
LOGGER.error("Failed to connect with message:" + e.getMessage(), e);
releaseResources();
}
}
#Override
public void onException(JMSException e) {
LOGGER.error("Exception received from jms", e);
}
Can you guys tell me what is the problem here or point me in the right direction?
Also, is this fine to have jms connection initialization in #PostConstruct of a spring bean?

Why EMS reports “reconnect failed: connection unknown for id=xxxxx”?
This message indicates that the EMS server does not have or no longer has the client connections information when the client attempts to reconnect.
There are two possible reasons:
Parameter “ft_reconnect_timeout” is not high enough. Before the client reconnects the server, the connection has already purged by the server.
This could be resolved by setting a higher value to the “ft_reconnect_timeout” parameter in tibemsd.conf. The default value is 60 seconds.
Parameter “ft_reconnect_timeout” is the amount of time (in seconds) that a backup server waits for clients to reconnect
(after it assumes the role of primary server in a failover situation), this parameter specifies in seconds how long the server will keep pending connections.
If a client does not reconnect within this time period, the server removes its state from the shared state files.
And if the client tries to reconnect after the time set in “ft_reconnect_timeout”, the server does not have the client connections information and prints the "reconnect failed: connection unknown" message.
So will suggest you to set the value according to your environment and test the same.Also
If Ft_reconnect_timeout value is high, a lot of connections and connection related objects are kept in the memory for a long time, you may have a memory issue. And if the connection is using clientID, you may run into “clientID already exists” issue.

Cannot programmatically submit Spark application (with Cassandra connector) to cluster from remote client

I'm running a standalone Spark cluster on EC2, and I'm writing a application using Spark-Cassandra connector driver and try to submit job to Spark cluster programmatically.
The job itself is simple:
public static void main(String[] args) {
SparkConf conf;
JavaSparkContext sc;
conf = new SparkConf()
.set("spark.cassandra.connection.host", host);
conf.set("spark.driver.host", "[my_public_ip]");
conf.set("spark.driver.port", "15000");
sc = new JavaSparkContext("spark://[spark_master_host]","test",conf);
CassandraJavaRDD<CassandraRow> rdd = javaFunctions(sc).cassandraTable(
"keyspace", "table");
System.out.println(rdd.first().toString());
sc.stop();
}
Which runs fine when I run that in the Spark Master node of my EC2 cluster.
I'm trying to running this in a remote Windows client.
The problem was from these two lines:
conf.set("spark.driver.host", "[my_public_ip]");
conf.set("spark.driver.port", "15000");
First, if i comment out these 2 lines, application would not throw a exception, but the Executor is not running, with following log:
14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/3 is now LOADING
14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/0 is now EXITED (Command exited with code 1)
14/12/06 22:40:03 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141207033931-0021/0 removed: Command exited with code 1
Which never ends, when I check the worker node log, I found:
14/12/06 22:40:21 ERROR security.UserGroupInformation: PriviledgedActionException as:[username] cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
... 7 more
I've no idea what that's about, my guess is that probably worker node could not connect to driver, which probably initially set as:
14/12/06 22:39:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#[some_host_name]:52660]
14/12/06 22:39:30 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver#[some_host_name]:52660]
Obviously, no DNS is going to resolve my host name...
Since I can't set deploy mode to "client" or "cluster", if not via ./spark-submit script.(Which I think that's absurd...). I try to add a host resolution "XX.XXX.XXX.XX [host-name]" in /etc/hosts of all Spark Master Worker nodes.
No luck of course...
That leads me to the second, un-comment that two line;
Which gives me:
14/12/06 22:59:41 INFO Remoting: Starting remoting
14/12/06 22:59:41 ERROR Remoting: Remoting error: [Startup failed] [
akka.remote.RemoteTransportException: Startup failed
at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
at akka.remote.Remoting.start(Remoting.scala:194)
...
Cause:
Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: /[my_public_ip]:15000
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:391)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:388)
I double checked my firewall setting and router setting, confirm that my firewall is diabled; and netstat -an to confirm port 15000 is not in use (in fact I tried to change to several available port, no luck); and I ping my public ip from both other machine and machine from my cluster, no problem.
Now I'm utterly screw up, I just run out of idea try to fix this. Any suggestions? Any help is appreciated!

Please check if 15000 is in your security group.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.