Jenkins Slaves Randomly Disconnect from the Master - java

I'm currently setting up a vm cluster to run Jenkins for use with a large project,
I have jenkins set up, all the VM's are running Windows 7 64-bit with plenty of ram, disk space and the slave agents deployed (running as windows service).I keep getting the following error after a few minutes.
Connection was broken
java.nio.channels.AsynchronousCloseException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:33)
at sun.nio.ch.ChannelInputStream.read(Unknown Source)
at sun.nio.ch.ChannelInputStream.read(Unknown Source)
at sun.nio.ch.ChannelInputStream.read(Unknown Source)
at java.io.InputStream.read(Unknown Source)
at sun.nio.ch.ChannelInputStream.read(Unknown Source)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at java.io.ObjectInputStream$PeekInputStream.peek(Unknown Source)
at java.io.ObjectInputStream$BlockDataInputStream.peek(Unknown Source)
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at hudson.remoting.Command.readFrom(Command.java:92)
at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:70)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
The slaves disconnect for 30 seconds to 2 minutes and then reconnect,
the response time also varies wildly between 400ms and 5 seconds.
The cluster is on its own switch and a ping from any machine returns a < 1ms time.
Any help?

There are a lot of bugs with the new NIO stuff in the Jenkins slave. We are experiencing similar serious instability issues up until the latest release at this time of writing.
For example:
https://issues.jenkins-ci.org/browse/JENKINS-22758
This posts deserves an update (August 2017): we are running 2.46 now and the slaves are a lot more stable.

Related

"Connection reset" when using SOAP

I have a REST service and some clients get a "Connection reset" error. But SOAP is stateless, so why doesn't it just simply reconnect and resend the request? It actually sends multiple messages in my use case, but the very first fails, and that is just to get some config data from the server. Is this something I need to configure? Should the client programmatically try to resend the message? Some users tried multiple times with the same result.
It never happened in the last years but now I get some reports of this problem.
The client uses in implementation of javax.xml.ws.Service, not just a raw socket. But even though I use JAX I get the low level error. It is wrapped by a WebServiceException, but that doesn't really help me fixing this problem.
The clients all use Java 8. It's either Update 66 or Update 74.
I am not able to reproduce the problem myself, I only have log files from users.
Here's the complete stack trace:
javax.xml.ws.WebServiceException: java.net.SocketException: Connection reset
at com.sun.xml.internal.ws.transport.http.client.HttpClientTransport.readResponseCodeAndMessage(Unknown Source)
at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.createResponsePacket(Unknown Source)
at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.process(Unknown Source)
at com.sun.xml.internal.ws.transport.http.client.HttpTransportPipe.processRequest(Unknown Source)
at com.sun.xml.internal.ws.transport.DeferredTransportPipe.processRequest(Unknown Source)
at com.sun.xml.internal.ws.api.pipe.Fiber.__doRun(Unknown Source)
at com.sun.xml.internal.ws.api.pipe.Fiber._doRun(Unknown Source)
at com.sun.xml.internal.ws.api.pipe.Fiber.doRun(Unknown Source)
at com.sun.xml.internal.ws.api.pipe.Fiber.runSync(Unknown Source)
at com.sun.xml.internal.ws.client.Stub.process(Unknown Source)
at com.sun.xml.internal.ws.client.sei.SEIStub.doProcess(Unknown Source)
at com.sun.xml.internal.ws.client.sei.SyncMethodHandler.invoke(Unknown Source)
at com.sun.xml.internal.ws.client.sei.SyncMethodHandler.invoke(Unknown Source)
at com.sun.xml.internal.ws.client.sei.SEIStub.invoke(Unknown Source)
at com.sun.proxy.$Proxy31.getLimits(Unknown Source)
at xxxxxxxxxxxxx.SOAPServerAdapter.connect(Unknown Source)
at xxxxxxxxxxxxxxxxxxxx(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at sun.security.ssl.InputRecord.readFully(Unknown Source)
at sun.security.ssl.InputRecord.read(Unknown Source)
at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)
at sun.security.ssl.SSLSocketImpl.readDataRecord(Unknown Source)
at sun.security.ssl.AppInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(Unknown Source)
... 18 more
It turned out it was about IPv4 and IPv6. I don't have enough knowledge to give a perfect answer but I can post here what they told me. Maybe this helps other devs/users who have the same problem.
So some clients have unexpected connection resets and it's not about server load as it usually is.
If the ISP of the client tries to get away from IPv4 they will give each user a unique IPv6 address (note that the ISP might do this gradually). They do not really have an IPv4 address per client any more, other than the IPv4 used locally, since most still use something like 192.168.0.0/24 for their LAN.
Instead of classic IPv4 they use some transaction mechanism (e.g. Dual-Stack Lite). Those clients to not have direct access to the IPv4 internet. So if your server only supports IPv4 then they might experience similar problems you get when they use a proxy. They encapsulate IPv4 packets within IPv6 packets for some parts of the communication. From Wikipedia: "The original IPv4 packet is recovered and NAT is performed upon the IPv4 packet and is routed to the public IPv4 Internet."
I don't really know what's going wrong here. Maybe the NAT runs out of addresses / ports or something like that. Or the process takes too long you the connection is reset by some node that is involved in the communication.
So there are two things to do:
Inform the ISP about those problems. They probably will help you trace the exact problem and help their clients so they can use your service. For that you need to know the ISP of the users that have the "connection reset" problem. Send them to https://www.whoismyisp.org/ or similar site.
Upgrade to IPv6 as soon as possible. Your server can use both versions of the protocol at the same time.
Check the load on your server. Looks like server is closing connections because of load - exception on web-service call

Apache Tomcat Connection Refused error while load testing with 1000 users using JMeter

I have deployed a Java EE application in linux and Apache Tomcat 7.0.42
Everything works fine when I load test for 100 users using JMeter(concurrent 100 threads requests)
But as soon as I change the users(or number of threads) to 1000 server is choked and it gives "Connection refused" error for all the requests after ~600.
I have done all fine tuning in the application and it is more of of a static web page now, even then it comes back with error.
Server Configuration: Ubuntu, 8 vCPU / 32 GB RAM / 960 GB HD
PS: The same application works well in AWS(Amazon Web Services), so you can rule out any problem with my machine running JMeter(client)
org.apache.http.conn.HttpHostConnectException: Connection to http://a.b.c.d:8080 refused
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:645)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:286)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:62)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1088)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1077)
at org.apache.jmeter.threads.JMeterThread.process_sampler(JMeterThread.java:428)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:256)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection timed out: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:127)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
... 12 more
Try adjusting the maxThreads and acceptCount attributes of the http connector in server.xml:
Each incoming request requires a thread for the duration of that
request. If more simultaneous requests are received than can be
handled by the currently available request processing threads,
additional threads will be created up to the configured maximum (the
value of the maxThreads attribute). If still more simultaneous
requests are received, they are stacked up inside the server socket
created by the Connector, up to the configured maximum (the value of
the acceptCount attribute). Any further simultaneous requests will
receive "connection refused" errors, until resources are available to
process them.
Reference: http://tomcat.apache.org/tomcat-7.0-doc/config/http.html
Thank you all!!
Problem was actually with network, when we tested using different IP address(IP spoofing), all requests were successful. Network was thinking that it was a DoS attack.
Thanks all. I had tried maxThreads & acceptCount and did a lot of tuning in Linux.
So the learning is: Conduct the performance test from a server which is located in the same zone.
Possibly 1000 concurrent requests (in one second) is out of reality. A better test would be to distribute the 1000 concurrent requests in an interval of time.
e.g.: The image show that 100 requests are executed in a period of 60 seconds, ie, almost two requests per second.

Diagnose intermittent connection timeout?

I have a java client that invokes a thread to hit a servlet and retrieves last few lines from logs at the server, and show the retrieved log lines on the client. Every once in a while, the log thread times out.
Application server is Tomcat, but the error is intermittently reproducible across both Tomcat and Websphere, with client on Windows and server on Windows. With client on Windows and server on AIX, this problem has not occurred till now. I must mention that the code was stable for quite a few iterations, and suddenly started giving these problems.
What I have tried so far
The log reading client invokes the thread every 0.1 sec (used a sleep). I tried increasing the sleep time to 5 sec in the code, but it did not help.
When creating URLConnection object, I set properties like connectTimeout and readTimeout. I don't think readTimeout can be a cause, because that would have thrown a Socket exception.
3 I tried working with Tomcat configuration.
Connector port="9962" protocol="HTTP/1.1" connectionTimeout="200000" redirectPort="8445" acceptCount="30"
4 . The url connection is "disconnected" after use.
5 The stack trace seems to imply that the request never reached the application server, could this be because of some OS layer limits on connection. But in that case, there would have been an entry in the Event viewer of Windows.
java.net.ConnectException: Connection timed out: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(Unknown Source)
at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at sun.net.NetworkClient.doConnect(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.<init>(Unknown Source)
at sun.net.www.http.HttpClient.New(Unknown Source)
at sun.net.www.http.HttpClient.New(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(Unknown Source)
How would you go about diagnosing this problem? The server logs don't show anything suspicious. The client and server do not have any other networking devices to the best of my knowledge so no proxy is required, and the firewall is switched off.
I have not used keep alive thus far.
It is difficult to predict what is causing this. However, your next step should be to try running a packet sniffer on the client and / or server to see if the TCP connection requests are making it to the windows machine.
If the problem occurs both with Tomcat and Websphere, that would imply that the cause is at a lower level; i.e. in the OSes TCP/IP stacks, firewall ... or in the network. (And if the server is running in a virtual, it could be a drop-out in the virtual networking.)

Can we disable JVM HeartBeat or prevent it from killing my applet

My applet is getting unexpectedly terminated. From log, I could see
JVM heartbeat .. dead, send ts: 654648165466, now ts: 654658163729, dT 9998263
I had a shutdown hook to see who is killing the applet & I got the following
stack Trace .........Thread[Java Plug-In Heartbeat Thread,5,main]
java.lang.Object.wait(Native Method)
java.lang.Thread.join(Unknown Source)
java.lang.Thread.join(Unknown Source)
java.lang.ApplicationShutdownHooks.runHooks(Unknown Source)
java.lang.ApplicationShutdownHooks$1.run(Unknown Source)
java.lang.Shutdown.runHooks(Unknown Source)
java.lang.Shutdown.sequence(Unknown Source)
java.lang.Shutdown.exit(Unknown Source)
java.lang.Runtime.exit(Unknown Source)
java.lang.System.exit(Unknown Source)
sun.plugin2.main.client.PluginMain.exit(Unknown Source)
sun.plugin2.main.client.PluginMain.access$1300(Unknown Source)
sun.plugin2.main.client.PluginMain$HeartbeatThread.run(Unknown Source)
I don't understand why heartbeat Thread is working here. I don't have multiple jvms & it is single applet without any socket communication. Only data transfer to server is through http
from the source code of PluginMain, it looks like we can prevent the HeartBeatThread from starting by setting an environment variable JPI_PLUGIN2_NO_HEARTBEAT to some value
http://www.javasourcecode.org/html/open-source/jdk/jdk-6u23/sun/plugin2/main/client/PluginMain.java.html

Tomcat Java Server Application Does Not Recover From Multiple Dependant java.net.SocketTimeoutExceptions

I have a Java/JSP web application served up by Tomcat that makes web service calls out to a partner web service to retrieve data. The technologies used in the partner service are unknown. The partner web service has frequent extended outages where it returns a SocketTimeoutException:
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(Unknown Source)
at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at sun.net.NetworkClient.doConnect(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.protocol.https.HttpsClient.<init>(Unknown Source)
at sun.net.www.protocol.https.HttpsClient.New(Unknown Source)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
If the partner web service has a brief outage, then recovers quickly, my application handles everything nicely.
If the partner web service has an extended outage of over an hour, and my application has had hundreds of calls to the service that have all timed out, at some point my application reaches a state where it does not recover. The partner service comes back, but my application calls to that service still result in the same exact SocketTimeoutException error.
If I start and stop Tomcat at that point then everything works fine after.
I am not using http keep-alives. My code is anal about cleaning up all object instances regardless of whether exceptions occur or not. It seems like the Tomcat Java process is "using up" some resource (sockets?), throwing away one with each error, until there are no more that can be used. Has anyone seen this before, and is there a solution apparent? I have done much searching on the matter and not found anyone with an identical problem.
Thanks in advance!
John
I had a situation in the past where I was running out of slots in the TCP/IP stacks for connections that were in the TCP_WAIT state there are some hard limits complied into the operating systems that you can bump up against. The way to find out what the limit is is to use a tool like netstat if you are running on windows server you can use some of the tools from sysinternals.
The solution to your problem might be a design pattern called Circuit Braker which is explained in the book called http://pragprog.com/book/mnee/release-it
With the circuit breaker pattern what happens is that your calls to the remote web service flow through the circuit breaker which will open the breaker when too many calls to the remote service fail, when the breaker is in the open state calls to the remote service will fail right away in the breaker code, usually you can program the breaker to retry and see if it will open again. Anyway the book has a better explanation than the brief one I just gave you.
https://bitbucket.org/asaikali/circuitbreaker/ has an open source sample implementation of the CircuitBreaker pattern.

Categories

Resources