Taking 5 seconds to shutdown a java grpc ManagedChannel

Taking 5 seconds to shutdown a java grpc ManagedChannel - java

I have a client that needs to disconnect from one server and connect to another. Its taking about 16 seconds. I still haven't debugged the connection logic, but I can see the shutdown of the channel is taking 5 seconds. Is this expected behavior, or should I be looking for thread starvation in my code.
LOG.debug("==============SHUTTING DOWN MANAGED CHANNEL");
long startTime=System.currentTimeMillis();
channel.shutdown().awaitTermination(20, SECONDS);
long endTime=System.currentTimeMillis();
LOG.debug("Time to shutdown channel ms = {}",endTime-startTime);
LOG.debug("==============RETURN FROM SHUTTING DOWN MANAGED CHANNEL");
From the log
2018-07-09 14:41:23,143 DEBUG [com.ticomgeo.ftc.client.FTCClient] (EE-ManagedExecutorService-singleThreaded-Thread-1) ==============SHUTTING DOWN MANAGED CHANNEL
2018-07-09 14:41:28,151 INFO [io.grpc.internal.ManagedChannelImpl] (grpc-default-worker-ELG-1-1) [io.grpc.internal.ManagedChannelImpl-1] Terminated
2018-07-09 14:41:28,152 DEBUG [com.ticomgeo.ftc.client.FTCClient] (EE-ManagedExecutorService-singleThreaded-Thread-1) Time to shutdown channel ms = 5009
2018-07-09 14:41:28,152 DEBUG [com.ticomgeo.ftc.client.FTCClient] (EE-ManagedExecutorService-singleThreaded-Thread-1) ==============RETURN FROM SHUTTING DOWN MANAGED CHANNEL

There are two shutdown functions, shutdown and shutdownNow. Is there any chance you have a calls going that are blocking shutdown? You may be better served by shutdownNow.
shutdown
Initiates an orderly shutdown in which preexisting calls continue but new calls are rejected.
shutdownNow
Initiates a forceful shutdown in which preexisting and new calls are rejected. Although forceful, the shutdown process is still not instantaneous; isTerminated() will likely return false immediately after this method returns.

Related

Azure App Service - Spring Boot - Hikari Errors

I have deployed Spring Boot application that has a Database based queue with jobs on App Service.
Yesterday I performed a few Scale out and Scale in operations while the application was working to see how it will behave.
At some point (not necessary related to scaling operations) application started to throw Hikari errors.
com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#1ae66f34 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection org.postgresql.jdbc.PgConnection#1ef85079 marked as broken because of SQLSTATE(08006), ErrorCode(0)
The following are stack traces from my scheduled job in spring and other information:
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Caused by: javax.net.ssl.SSLException: Connection reset by peer (Write failed)
Suppressed: java.net.SocketException: Broken pipe (Write failed)
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
Next the following stack of errors:
WARN 1 --- [ scheduling-1] com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#48d0d6da (This connection has been closed.).
Possibly consider using a shorter maxLifetime value.
org.springframework.jdbc.support.MetaDataAccessException: Error while extracting DatabaseMetaData; nested exception is java.sql.SQLException: Connection is closed
Caused by: java.sql.SQLException: Connection is closed
The code which is invoked periodically - every 500 milliseconds is here:
#Scheduled(fixedDelayString = "${worker.delay}")
#Transactional
public void execute() {
jobManager.next(jobClass).ifPresent(this::handleJob);
}
Update.
The above code is almost all the time doing nothing, since there was no traffic on the website.
Update2. I've checked Postgres logs and found this:
2020-07-11 22:48:09 UTC-5f0866f0.f0-LOG: checkpoint starting: immediate force wait
2020-07-11 22:48:10 UTC-5f0866f0.f0-LOG: checkpoint complete (240): wrote 30 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.046 s, sync=0.046 s, total=0.437 s; sync files=13, longest=0.009 s, average=0.003 s; distance=163 kB, estimate=13180 kB
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: received immediate shutdown request
2020-07-11 22:48:10 UTC-5f0a3f41.8914-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0a3f41.8914-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
// Same text about 10 times
2020-07-11 22:48:10 UTC-5f0866f2.7c-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: src/port/kill.c(84): Process (272) exited OOB of pgkill.
2020-07-11 22:48:10 UTC-5f0866f1.fc-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0866f1.fc-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-07-11 22:48:10 UTC-5f0866f1.fc-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: archiver process (PID 256) exited with exit code 1
2020-07-11 22:48:11 UTC-5f0866ee.68-LOG: database system is shut down
It looks like it is a problem with Azure PostgresSQL server and it closed itself. Am I reading this right?

Like mentioned in your logs, have you tried setting maxLifetime property for the Hikari CP ? I think after setting that property this issue should be resolved.
Based on Hikari doc (https://github.com/brettwooldridge/HikariCP) --
maxLifetime
This property controls the maximum lifetime of a connection in the pool. An in-use connection will never be retired, only when it is closed will it then be removed. On a connection-by-connection basis, minor negative attenuation is applied to avoid mass-extinction in the pool. We strongly recommend setting this value, and it should be several seconds shorter than any database or infrastructure imposed connection time limit. A value of 0 indicates no maximum lifetime (infinite lifetime), subject of course to the idleTimeout setting. The minimum allowed value is 30000ms (30 seconds). Default: 1800000 (30 minutes)

Vert.x Thread Naming - "vert.x-worker-thread-..."

I am making a blocking service call in one of my worker verticles that logged a warning. This was "addressed" by increasing the time limit, but, I am more curious about how to read the naming of the thread in the log line - vert.x-worker-thread-3,5,main. The full log was this -
io.vertx.core.impl.BlockedThreadChecker
WARNING: Thread Thread[vert.x-worker-thread-3,5,main] has been blocked for 64134 ms, time limit is 60000
io.vertx.core.VertxException: Thread blocked
What does the 3,5,main indicate? Is it some kind of trace from the main verticle? Thanks.

ThreadName: vert.x-worker-thread-3
ThreadPriority: 5
Source: main

Why doesn't Future.get(...) kill the thread?

I have an application that is using future for asynchronous execution.
I set the parameter on get method, that the thread should get killed after 10 seconds, when it does not get the response:
Future<RecordMetadata> meta = producer.send(record, new ProducerCallBack());
RecordMetadata data = meta.get(10, TimeUnit.SECONDS);
But the thread get killed after 60 second:
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
at org.apache.kafka.clients.producer.KafkaProducer$FutureFailure.<init>(KafkaProducer.java:1124)
at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:823)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:760)
at io.khinkali.KafkaProducerClient.main(KafkaProducerClient.java:49)
Caused by: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
What am I doing wrong?

From the docs:
The threshold for time to block is determined by max.block.ms after
which it throws a TimeoutException.
Check Kafka Appender config in logback.xml, look for:
<producerConfig>max.block.ms=60000</producerConfig>

I set the parameter on get method, that the thread should get killed after 10 seconds, when it does not get the response:
If we are talking about Future.get(...) there is nothing about it that "kills" the thread at all. To quote from the javadocs, the Future.get(...) method:
Waits if necessary for at most the given time for the computation to complete, and then retrieves its result, if available.
If the get(...) method times out then it will throw TimeoutException but your thread is free to continue to run. If you want to stop the thread running then you'll need to catch TimeoutException and then call meta.cancel(true) but even that doesn't guarantee that the thread will be "killed". That causes the thread to be interrupted which means that certain methods will throw InterruptedException or the thread needs to be checking for Thread.currentThread().isInterrupted().
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
Yeah this timeout has nothing to do with the Future.get(...) timeout.

Jetty Websocket IdleTimeout

I've been working on annotated websockets lately, with the Jetty API (9.4.5 release) , and made a chat with it.
However i got an issue, after 5 minutes (which i believe is the default timer), the session is closed (it is not due to an error).
The only solution I've found yet, is to notify my socket On closing event and reopen the connection in a new socket.
However i've read on stackOverflow, that by setting IdleTimeOut in the WebsocketPolicy, i could avoid the issue:
I've tried setting to 3600000 for instance, but the behavior does not change at all
I also tried to set it to -1 but i get the following error: IdleTimeout [-1] must be a greater than or equal to 0
private ServletContextHandler setupWebsocketContext() {
ServletContextHandler websocketContext = new AmosContextHandler(ServletContextHandler.SESSIONS | ServletContextHandler.SECURITY);
WebSocketHandler socketCreator = new WebSocketHandler(){
#Override
public void configure(WebSocketServletFactory factory){
factory.getPolicy().setIdleTimeout(-1);
factory.getPolicy().setMaxTextMessageBufferSize(MAX_MESSAGE_SIZE);
factory.getPolicy().setMaxBinaryMessageBufferSize(MAX_MESSAGE_SIZE);
factory.getPolicy().setMaxTextMessageSize(MAX_MESSAGE_SIZE);
factory.getPolicy().setMaxBinaryMessageSize(MAX_MESSAGE_SIZE);
factory.setCreator(new UpgradedSocketCreator());
}
};
ServletHolder sh = new ServletHolder(new WebsocketChatServlet());
websocketContext.addServlet(sh, "/*");
websocketContext.setContextPath("/Chat");
websocketContext.setHandler(socketCreator);
websocketContext.getSessionHandler().setMaxInactiveInterval(0);
return websocketContext;
}
I've also tried to change the policy directly in the OnConnect event, by using the call session.getpolicy.setIdleTimeOut(), but I haven't noticed any results.
Is this an expected behavior or am I missing something? Thanks for your help.
EDIT:
Log on the closure:
Client Side:
2017-07-03T12:48:00.552 DEBUG HttpClient#179313750-scheduler Ignored idle endpoint SocketChannelEndPoint#2fb4b627{localhost/127.0.0.1:5080<->/127.0.0.1:53835,OPEN,fill=-,flush=-,to=1/300000}{io=0/0,kio=0,kro=1}->WebSocketClientConnection#e0198ece[ios=IOState#3ac0ec79[CLOSING,in,!out,close=CloseInfo[code=1000,reason=null],clean=false,closeSource=LOCAL],f=Flusher[queueSize=0,aggregateSize=0,failure=null],g=Generator[CLIENT,validating],p=Parser#65c4d838[ExtensionStack,s=START,c=0,len=187,f=null]]
Server side:
2017-07-03T12:48:00.595 DEBUG Idle pool thread onClose WebSocketServerConnection#e0033d54[ios=IOState#10d40dca[CLOSED,!in,!out,finalClose=CloseInfo[code=1000,reason=null],clean=true,closeSource=REMOTE],f=Flusher[queueSize=0,aggregateSize=0,failure=null],g=Generator[SERVER,validating],p=Parser#317213f3[ExtensionStack,s=START,c=0,len=2,f=CLOSE[len=2,fin=true,rsv=...,masked=true]]]<-SocketChannelEndPoint#690dfbfb'{'/127.0.0.1:53835<->/127.0.0.1:5080,CLOSED,fill=-,flush=-,to=1/360000000}'{'io=0/0,kio=-1,kro=-1}->WebSocketServerConnection#e0033d54[ios=IOState#10d40dca[CLOSED,!in,!out,finalClose=CloseInfo[code=1000,reason=null],clean=true,closeSource=REMOTE],f=Flusher[queueSize=0,aggregateSize=0,failure=null],g=Generator[SERVER,validating],p=Parser#317213f3[ExtensionStack,s=START,c=0,len=2,f=CLOSE[len=2,fin=true,rsv=...,masked=true]]]
2017-07-03T12:48:00.595 DEBUG Idle pool thread org.eclipse.jetty.util.thread.Invocable$InvocableExecutor#4f13dee2 invoked org.eclipse.jetty.io.ManagedSelector$$Lambda$193/682154970#551e133a
2017-07-03T12:48:00.595 DEBUG Idle pool thread EatWhatYouKill#6ba355e4/org.eclipse.jetty.io.ManagedSelector$SelectorProducer#7b1559f1/PRODUCING/0/1 produce exit
2017-07-03T12:48:00.595 DEBUG Idle pool thread ran EatWhatYouKill#6ba355e4/org.eclipse.jetty.io.ManagedSelector$SelectorProducer#7b1559f1/PRODUCING/0/1
2017-07-03T12:48:00.595 DEBUG Idle pool thread run EatWhatYouKill#6ba355e4/org.eclipse.jetty.io.ManagedSelector$SelectorProducer#7b1559f1/PRODUCING/0/1
2017-07-03T12:48:00.595 DEBUG Idle pool thread EatWhatYouKill#6ba355e4/org.eclipse.jetty.io.ManagedSelector$SelectorProducer#7b1559f1/PRODUCING/0/1 run
2017-07-03T12:48:00.597 DEBUG Idle pool thread 127.0.0.1 has disconnected !
2017-07-03T12:48:00.597 DEBUG Idle pool thread Disconnected: 127.0.0.1 (127.0.0.1) (statusCode= 1,000 , reason=null)

Annotated WebSockets have their own timeout settings in the annotation.
#WebSocket(maxIdleTime=30000)

The annotation #WebSocket has option:
int maxIdleTime() default -2;
In fact it's not clear what does it mean.
If you check implementation, you can find:
if (anno.maxIdleTime() > 0)
{
this.policy.setIdleTimeout(anno.maxIdleTime());
}
method implementation:
/**
* The time in ms (milliseconds) that a websocket may be idle before closing.
*
* #param ms
* the timeout in milliseconds
*/
public void setIdleTimeout(long ms)
{
assertGreaterThan("IdleTimeout",ms,0);
this.idleTimeout = ms;
}
and finally:
/**
* The time in ms (milliseconds) that a websocket may be idle before closing.
* <p>
* Default: 300000 (ms)
*/
private long idleTimeout = 300000;
Conclusion: negative value apply default behavior (300000 ms). You need to configure 'idleTimeout' according your business value.
PS: solved my case with:
#WebSocket(maxIdleTime = Integer.MAX_VALUE)

Stopping gobbler threads in blocking reads on Process InputStream

I have a gobbler that reads the output from a Process.
There is a case where we kill the process programatically using its PID and the external Windows taskkill command.
It is a 16-Bit DOS process
We taskkill because it is a DOS 16-bit process and Process.destroyForcibly() does not work with it because it resides in the ntvdm subsystem and the best way is to get the PID and use 'taskkill /T /F' which does indeed kill it and any children.
Normally, we have no problem with our DOS 16-bit (or 32 bit) processes. This one has some file locks in place. It is especially important that we ensure it is dead to have the OS release the locks.
We close all streams before and after the kill
Prior to calling taskkill, we attempt to flush and close all streams in an executor: in,out,err. After calling taskkill, we verify that all streams are closed by re-closing them.
We call Thread.interrupt() on all gobblers after the kill
Now, after the kill success, which is confirmed in the OS as well, the gobbler is still running and it does not respond to Thread.interrupt().
We even do a last-ditch Thread.stop (gasp!)
And furthermore, we have invoked Thread.stop() on it and it still stays waiting at the read stage ...
So, it seems, we are unable to stop the std-out and std-in gobblers on our Processes streams.
We know Thread.stop() is deprecated. To be somewhat safe, we catch ThreadDeath
then clean any monitors and then rethrow ThreadDeath. However,
ThreadDeath never in fact gets thrown and the thread just keeps on
waiting on inputStream.read ..
so Thread.stop being deprecated in this case is a moot point
... because it does not do anything.
Just so no one flames me and so that I have a clean conscience,
We have removed Thread.stop() from our production code.
I am not surprised that the Thread does not interrupt since that only happens on some InputStreams and not all reads are incorruptible. But I am surprised that the Thread will not stop when Thread.stop is invoked.
Thread trace shows
A thread trace shows that both main-in and main-er (the two outputs from the process) are still running even after the streams are closed, thread is interrupted and last ditch Thread.stop is called.
The task is dead, so why care about idle blocked gobblers?
It is not that we care that the gobblers won't quit. But we hate threads running that just pile up and clog the system. This particular process is called by a webserver and then .. it could amount to several hundred idle threads in a blocking state on dead processes...
We have tried launching the process two ways with no difference ...
run(working, "cmd", "/c", "start", "/B", "/W", "/SEPARATE", "C:\\workspace\\dotest.exe");
run(working, "cmd", "/c", "C:\\workspace\\dotest.exe");
The gobbler is in a read like this:
try (final InputStream is = inputStream instanceof BufferedInputStream
? inputStream : new BufferedInputStream(inputStream, 1024 * 64);
final BufferedReader br = new BufferedReader(new InputStreamReader(is, charset))) {
String line;
while ((line = br.readLine()) != null) {
lineCount++;
lines.add(line);
if (Thread.interrupted()) {
Thread.currentThread().interrupt();
throw new InterruptedException();
}
}
eofFound = true;
}
Our destroyer calls this on the gobbler thread after the taskkill:
int timeLimit = 500;
t.interrupt();
try {
t.join(timeLimit);
if (t.isAlive()) {
t.stop();
// we knows it's deprecated but we catch ThreadDeath
// then clean any monitors and then rethrow ThreadDeath
// But ThreadDeath never in fact gets thrown and the thread
// just keeps on waiting on inputStream.read ..
logger.warn("Thread stopped because it did not interrupt within {}ms: {}", timeLimit, t);
if (t.isAlive()) {
logger.warn("But thread is still alive! {}", t);
}
}
} catch (InterruptedException ie) {
logger.info("Interrupted exception while waiting on join({}) with {}", timeLimit, t, ie);
}
This is a snippet of the log output:
59.841 [main] INFO Destroying process '5952'
04.863 [main] WARN Timeout waiting for 'Close java.io.BufferedInputStream#193932a' to finish
09.865 [main] WARN Timeout waiting for 'Close java.io.FileInputStream#159f197' to finish
09.941 [main] DEBUG Executing [taskkill, /F, /PID, 5952].
10.243 [Thread-1] DEBUG SUCCESS: The process with PID 5952 has been terminated.
10.249 [main] DEBUG java.lang.ProcessImpl#620197 stopped with exit code 0
10.638 [main] INFO Destroyed WindowsProcess(5952) forcefully in 738 ms.
11.188 [main] WARN Thread stop called because it did not interrupt within 500ms: Thread[main-in,5,main]
11.188 [main] WARN But thread is still alive! Thread[main-in,5,main]
11.689 [main] WARN Thread stop because it did not interrupt within 500ms: Thread[main-err,5,main]
11.689 [main] WARN But thread is still alive! Thread[main-err,5,main]
Note: prior to calling taskkill, the Process std-out and std-err will not close. But they are closed manually after the taskkill (not shown in log because success).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Taking 5 seconds to shutdown a java grpc ManagedChannel - java

Related

Azure App Service - Spring Boot - Hikari Errors

Vert.x Thread Naming - "vert.x-worker-thread-..."

Why doesn't Future.get(...) kill the thread?

Jetty Websocket IdleTimeout

Stopping gobbler threads in blocking reads on Process InputStream

Categories

Resources