Finagle service discovery issue - java

Here I wanted to register to 2 endpoints and send requests to them. You can see this in the code below. I name one env1 and the other env2.
val client = Http.client
.configured(Transport.Options(noDelay = false, reuseAddr = false))
.newService("gexampleapi-env1.localhost.net:8081,gexampleapi-env2.localhost.net:8081")
So far everything is normal. But env1 instance had to be down for some reason(for a few hours' maintenance etc. not sure why.). Under normal circumstances, our expectation is that it continues to send requests through the env2 instance. But this didn't happen. Could not send requests to both servers. Normally it was working correctly, but it didn't work that day for a reason we don't know.
Since the event took place months ago, I only have the following log.
2022-02-15 12:09:40,181 [finagle/netty4-1-3] INFO com.twitter.finagle
FailureAccrualFactory marking connection to "gExampleAPI" as dead.
Remote Address:
Inet(gexampleapi-env1.localhost.net/10.0.0.1:8081,Map())
To solve the problem, we removed gexampleapi-env1.localhost.net:8081 host from the config file. and after restarting it continued to process requests. If you have any ideas about why we may have experienced this problem and how to avoid this next time, I would appreciate it if you could share them.

Related

akka.pattern.AskTimeoutException while running Lagom HelloWorld example

I have a problem while trying my hands on the Hello World example explained here.
Kindly note that I have just modified the HelloEntity.java file to be able to return something other than "Hello, World!". Most certain my changes are taking time and hence I am getting the below Timeout error.
I am currently trying (doing a PoC) on a single node to understand the Lagom framework and do not have liberty to deploy multiple nodes.
I have also tried modifying the default lagom.circuit-breaker in application.conf "call-timeout = 100s" however, this does not seem to have helped.
Following is the exact error message for your reference:
{"name":"akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://hello-impl-application/system/sharding/HelloEntity#1074448247]] after [5000 ms]. Sender[null] sent message of type \"com.lightbend.lagom.javadsl.persistence.CommandEnvelope\".","detail":"akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://hello-impl-application/system/sharding/HelloEntity#1074448247]] after [5000 ms]. Sender[null] sent message of type \"com.lightbend.lagom.javadsl.persistence.CommandEnvelope\".\n\tat akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:595)\n\tat akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:605)\n\tat akka.actor.Scheduler$$anon$4.run(Scheduler.scala:140)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:866)\n\tat scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)\n\tat scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)\n\tat scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:864)\n\tat akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\n\tat akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\n\tat java.lang.Thread.run(Thread.java:748)\n"}
Question: Is there a way to increase the akka Timeout by modifying the application.conf or any of the java source files in the Hello World project? Can you please help me with the exact details.
Thanks in advance for you time and help.
The call timeout is the timeout for circuit breakers, which is configured using lagom.circuit-breaker.default.call-timeout. But that's not what is timing out above, the thing that is timing out above is the request to your HelloEntity, that timeout is configured using lagom.persistence.ask-timeout. The reason why there's a timeout on requests to entities is because in a multi-node environment, your entities are sharded across nodes, so an ask on them may go to another node, which is why a timeout is needed in case that node is not responding.
All that said, I don't think changing the ask-timeout will solve your problem. If you have a single node, then your entities should respond instantly if everything is working ok.
Is that the only error you're seeing in the logs?
Are you seeing this in devmode (ie, using the runAll command), or are you running the Lagom service some other way?
Is your database responding?
Thanks James for the help/pointer.
Adding following lines to resources/application.conf did the trick for me:
lagom.persistence.ask-timeout=30s
hello {
..
..
call-timeout = 30s
call-timeout = ${?CIRCUIT_BREAKER_CALL_TIMEOUT}
..
}
A Call is a Service-to-Service communication. That’s a SeviceClient communicating to a remote server. It uses a circuit breaker. It is a extra-service call.
An ask (in the context of lagom.persistence) is sending a command to a persistent entity. That happens across the nodes insied your Lagom service. It is not using circuit breaking. It is an intra-service call.

What causes ch.ethz.ssh2.Connection to have a hang time?

I am currently using ch.ethz.ssh2.Connection to connect to my servers in java. sometimes it hangs on one server.(maybe like 10-15 seconds). I wanted to know what causes this hang time and how to avoid it.
Connection sample
conn = new ch.ethz.ssh2.Connection(serverName);
conn.connect();
boolean isAuthenticated = conn.authenticateWithPassword(user, pass);
logger.info("Connecting to " + server);
if (isAuthenticated == false) {
logger.info(server + " Please check credentials");
}
sess = conn.openSession();
// I am connecting to over 200 servers and closing them. What would be the best practice to loop thru all these servers in the minimal time.
//some servers quickly connects, while some takes some time.
why does this happen?
The main question is: Is it a code problem, a network problem or a server problem.
A code problem can be debugged - unfortunately ch.ethz.ssh2.Connection does not have any logging possibility to detect what is going inside.
May be you should thing about switching the ssh library (or use it for some tests with the problematic servers). From my experience sshj is very useful.
If it is a network problem or a server problem you can check what is going on via Wireshark. If network packets are sent but the response is delayed the problem is not the used client-side code.
My psychic debugging powers tell me that the server is doing a DNS lookup on the IP address of each client which connects. These DNS lookups are either taking a long time to complete, or they're failing entirely. The DNS lookup will block the authentication process until it finishes, successfully or not.
If the server is the OpenSSH server, this behavior is controlled by the sshd config "UseDNS" option.

Failover for pooled LDAP connections

I am working with a Spring LDAP application for which I did none of the setup of the LDAP workings, but now I need to add a failover feature.
We supply our ContextSource with two space-seperated URLs:
String theseUrls = primaryLdapUrl + " " + secondaryLdapUrl;
environment.put("java.naming.provider.url", theseUrls);
ilc = new InitialLdapContext(environment, null);
If the primary URL is functional, then it connects to that. If not, it connects to the secondary just fine. The connections are then pooled, however I am having trouble figuring out the exact mechanics. But, as it is, due to the pooling if the established connection goes down, the whole application shits the bed.
Is there a way to disable pooling, or create a short timeout for it? I have done some research but can't find exact mechanics that have worked for me (including trying to call setPooled(false)). Ideally, the secondary server is only queried if the first is down. When the first is restored, then it will go back to that.
NOTE: This URL (http://forum.spring.io/forum/spring-projects/data/ldap/34643-switching-ldap-contexts-for-failover) has given me a lot of ideas, but I can't get anything to work.

Axis2 1.5.1 connections management

HttpConnections where not being used efficiently by our code using Axis2 1.5.1 project. By setting a certain limit of max connections per host and stressing the application, responsiveness was not the good I expected according the intentional limits and sometimes connections got stucked indefinitly, so the available connections were each time less till reaching the point that none request was attended by the application.
Configuration:
MultiThreadedHttpConnectionManager connManager = new MultiThreadedHttpConnectionManager();
HttpConnectionManagerParams connectionManagerParams = connManager.getParams();
connectionManagerParams.setMaxTotalConnections(httpMaxConnections);
connectionManagerParams.setDefaultMaxConnectionsPerHost(httpMaxConnectionsPerHost);
HttpClient httpClient = new HttpClient(connManager);
ConfigurationContext axisContext;
try {
axisContext = ConfigurationContextFactory.createDefaultConfigurationContext();
} catch (Exception e) {
throw new AxisFault(e.getMessage());
}
axisContext.setProperty(HTTPConstants.CACHED_HTTP_CLIENT, httpClient);
service = new MyStub(axisContext, url);
ServiceClient serviceClient = service._getServiceClient();
serviceClient.getOptions().setProperty(HTTPConstants.CONNECTION_TIMEOUT, httpConnectionTimeout);
serviceClient.getOptions().setProperty(HTTPConstants.SO_TIMEOUT, httpReadTimeout);
serviceClient.getOptions().setProperty(HTTPConstants.REUSE_HTTP_CLIENT, Constants.VALUE_TRUE);
So, as you can see, we're defining max. connections and timeouts.
I have a workaround I will share, hoping to help somebody under hurries as I was. I'll mark my answer as the good one a few days later if there isn't any better answer from experts.
1) PoolTimeout to prevent the connections that got stucked (for any reason)
Next line helped us to prevent Axis2 to lose connections that got stucked forever:
httpClient.getParams().setParameter(HttpClientParams.CONNECTION_MANAGER_TIMEOUT, 1000L);
Let's call it PoolTimeout in this entry. Make sure it's a Long, since an Integer (or int) would raise a ClassCastException that will prevent your service to even be triggered outside your client.
The system you're developing, and that is using Axis, could be in turn a client for another system. And that other system will have for sure an specific ConnectionTimeout. So I suggest
PoolTimeout <= ConnectionTimeout
Example:
serviceClient.getOptions().setProperty(HTTPConstants.CONNECTION_TIMEOUT, httpConnectionTimeout);
httpClient.getParams().setParameter(HttpClientParams.CONNECTION_MANAGER_TIMEOUT, Long.valueOf(httpConnectionTimeout) );
2) Connections release
I was using Amila's suggestion for connection management, but actually the connections were not released as fast as in advance I expected they would be (because I prepared consciously the delay times mocked external system would respond to fit limits accordingly my tunning configuration).
So I found that next lines, in method org.apache.axis2.client.OperationClient.executeImpl(boolean), helped to mark as available the connection in the pool as soon as it's been used:
HttpMethod method = (HttpMethod) getOperationContext().getMessageContext(WSDLConstants.MESSAGE_LABEL_OUT_VALUE)
.getProperty(HTTPConstants.HTTP_METHOD);
method.releaseConnection();
That's what Axis is trying to do when calling serviceClient.cleanupTransport() but it seems the context is not correct.
Now, performance tunning is working in a predictable way, so it's in hands of our integrators to select the tunning configuration that best suits production needs.
A better answer will be highly appreciated.

Zookeeper Network Ensemble does not start appropiately

I've been working with zookeeper lately to fill a requirement of reliablity in distributed applications. I'm working with three computers and I followed this tutorial:
http://sanjivblogs.blogspot.ie/2011/04/deploying-zookeeper-ensemble.html
I did step by step to ensure I did it well, but now when I start my zookeepers with the
./zkServer.sh start
I'm getting these exceptions for all my computers:
2013-04-05 21:46:58,995 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker#679] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
2013-04-05 21:46:58,995 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker#688] - Send worker leaving thread
2013-04-05 21:47:58,363 [myid:2] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker#762] - Connection broken for id 3, my id = 2, error =
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
But I don't know what I am doing wrong to get this. My objective is to syncrhonize my zookeepers in different machines to get always an available service. I went to zookeeper.apache.org Web Päge and look for the same information on how to configure and start my zookeeper, but are the same steps I followed before.
If somebody could help me please I would really appretiate it. Thanks in advance.
I will need to follow some strict steps to achieve this, but finally done it. If somebody is facing the same issue, to make the zookeeper enssemble, please remember:
You need 3 zookeeper servers running (local or over the network), this is the minimum number to achieve the synchronization. In each server, it is needed to create a file called "myid" (inside the zookeeper folder), the content of each myid file must be a sequential number, for instance, I have three zookeeper servers (folders), so I have a myid with content 1, other with content 2 and other with content 3.
Then in the zoo.cfg it is necessary to stablish the parameters required:
tickTime=2000
#dataDir=/var/lib/zookeeper
dataDir=/home/mtataje/var/zookeeper1
clientPort=2184
initLimit=10
syncLimit=20
server.1=192.168.3.41:2888:3888
server.2=192.168.3.41:2889:3889
server.3=192.168.3.41:2995:2999
The zoo.cfg varies from each server to another, in my case because I was testing on local, I needed to change the port and the dataDir.
After that, excutes the:
./zkServer.sh start
Maybe some exceptions will appear, but it is because at least two zookeepers must be synchronized, when you start at least 2 zookeepers, exceptions should be gone.
Best regards.

Categories

Resources