I have set up a replica set using three machines (192.168.122.21, 192.168.122.147 and 192.168.122.148) and I am interacting with the MongoDB Cluster using the Java SDK:
ArrayList<ServerAddress> addrs = new ArrayList<ServerAddress>();
addrs.add(new ServerAddress("192.168.122.21", 27017));
addrs.add(new ServerAddress("192.168.122.147", 27017));
addrs.add(new ServerAddress("192.168.122.148", 27017));
this.mongoClient = new MongoClient(addrs);
this.db = this.mongoClient.getDB(this.db_name);
this.collection = this.db.getCollection(this.collection_name);
After the connection is established I do multiple inserts of a simple test document:
for (int i = 0; i < this.inserts; i++) {
try {
this.collection.insert(new BasicDBObject(String.valueOf(i), "test"));
} catch (Exception e) {
System.out.println("Error on inserting element: " + i);
e.printStackTrace();
}
}
When simulating a node crash of the master server (power-off), the MongoDB cluster does a successful failover:
19:08:03.907+0100 [rsHealthPoll] replSet info 192.168.122.21:27017 is down (or slow to respond):
19:08:03.907+0100 [rsHealthPoll] replSet member 192.168.122.21:27017 is now in state DOWN
19:08:04.153+0100 [rsMgr] replSet info electSelf 1
19:08:04.154+0100 [rsMgr] replSet couldn't elect self, only received -9999 votes
19:08:05.648+0100 [conn15] replSet info voting yea for 192.168.122.148:27017 (2)
19:08:10.681+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
19:08:10.910+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
19:08:16.394+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
19:08:22.876+.
19:08:22.912+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
19:08:23.623+0100 [SyncSourceFeedbackThread] replset setting syncSourceFeedback to 192.168.122.148:27017
19:08:23.917+0100 [rsHealthPoll] replSet member 192.168.122.148:27017 is now in state PRIMARY
This is also recognized by the MongoDB Driver on the Client Side:
Dec 01, 2014 7:08:16 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: Read timed out
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017] bc:java.net.SocketTimeoutException: connect timed out
Dec 01, 2014 7:08:36 PM com.mongodb.DBTCPConnector setMasterAddress
WARNING: Primary switching from /192.168.122.21:27017 to /192.168.122.148:27017
But it still keeps trying to connect to the old node (forever):
Dec 01, 2014 7:08:50 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host
.....
Dec 01, 2014 7:10:43 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException -message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host
The Document count on the Database stays the same from the moment the primary fails and a secondary becomes primary. Here is the Output from the same node during the process:
"rs0":SECONDARY> db.test_collection.find().count() 12260161
"rs0":PRIMARY> db.test_collection.find().count() 12260161
Update:
Using WriteConcern Unacknowledged it works as designed. Insert Operations are also performed on the new master and all operations during the election process get lost.
With WriteConcern Acknowleged it seems that the Operation is waiting infinitely for an ACK from the crashed master. This could explain why the program continuous after the crashed server boots up again and joins the cluster as a secondary. But in my case I don't want the driver to wait forever, it should raise an error after a certain time.
Update:
WriteConcern Acknowledged is also working as expected when killing the mongod process on the primary. In this case the failover only takes ~3 Seconds. During this time no inserts are done, and after the new primary is elected the insert operations continue.
So I only get the problem when simulating a node failure (power off/network down). In this case the operation hangs until the failed node starts up again.
Does your app still work? Since that server is still in your seed list, the driver will try to connect to it as far as I know. Your app should still work so long as any of the other servers in your seed list can gain primary status.
Explicit specifying a Connection Timeout Value solved the error. See also: http://api.mongodb.org/java/2.7.0/com/mongodb/MongoOptions.html
Related
I'm working with AWSIotMqttClient to subscribe and publish to topics. After I received all messages I want to stop my application. But it works infinitely long. Here a simple code which works infinitely, even without subscribes:
SampleUtil.KeyStorePasswordPair keyStorePasswordPair =
SampleUtil.getKeyStorePasswordPair("certificate.pem", "privateKey.pem");
AWSIotMqttClient client = new AWSIotMqttClient(
"mqtt-broker.address",
"deviceID",
keyStorePasswordPair.keyStore,
keyStorePasswordPair.keyPassword);
client.connect();
client.disconnect();
Here is my logs:
> Task :Application.main()
Cert file:certificate.pem Private key: privateKey.pem
????. 20, 2019 11:56:45 AM com.amazonaws.services.iot.client.core.AwsIotConnection onConnectionSuccess
INFO: Connection successfully established
????. 20, 2019 11:56:45 AM com.amazonaws.services.iot.client.core.AbstractAwsIotClient onConnectionSuccess
INFO: Client connection active: 5dcd561596c30e0001e4b5d5
????. 20, 2019 11:56:53 AM com.amazonaws.services.iot.client.core.AwsIotConnection onConnectionClosed
INFO: Connection permanently closed
????. 20, 2019 11:56:53 AM com.amazonaws.services.iot.client.core.AbstractAwsIotClient onConnectionClosed
INFO: Client connection closed: 5dcd561596c30e0001e4b5d5
I can stop my application with System.exit(0) but it is undesirable for me. I expect that after all code is passed, the application will stop.
Found here . Updating aws-iot-device-sdk to 1.3.4 version
implementation 'com.amazonaws:aws-iot-device-sdk-java:1.3.4'
implementation 'com.amazonaws:aws-iot-device-sdk-java-samples:1.3.4'
resolved the problem.
Below is the description of problem we faced in production. Please note that I could not reproduce the issue in test or local environment and therfore can not provide you with test code.
We have a hazelcast cluster with two members M1, M2 and three clients C1,C2,C3. Hazelcast version is 3.9.
Clients use IMap.tryLock() method with timeout of 10 seconds. After getting the lock, critical and long running operations are performed and finally the lock is released using IMap.unlock() method.
The problem occured in production is as follows:
At some time instant t, we first saw heartbeat failure to M2 at client C2. Afterwards there are errors in fetching partition table casued by com.hazelcast.spi.exception.TargetDisconnectedException:
[hz.client_0.internal-2 ] WARN [] HeartbeatManager - hz.client_0 [mygroup] [3.9] HeartbeatManager failed to connection: .....
[hz.client_0.internal-3 ] WARN [] ClientPartitionService - hz.client_0 [mygroup] [3.9] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, ......
Around 250 ms after initial heartbeat failure, client gets disconnected and then reconnects in 20 ms.
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_DISCONNETED
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_CONNECTED
The problem we are having is, for some keys that are previously acquired by C2, C1 and C3 can not acquire the lock even if it seems to be released by C2. C2 can get the lock, but this puts unacceptable delays
to the application and is not acceptable.. All clients should get since lock is released...
We were notified of the problem after receiving complaints, and then restarted the client application C2.
As documented in http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Lock.html, locks acquired by restarted member (C2 in my case) seemed to be removed after restart operation.
Currently the issue seems to go away, but we are not sure if it will recur.
Do you have any suggestions about the probable cause and more importantly do you have any recommendations?
Would enabling redo-operation in client help for this problem case?
As I tried to explain client seems to recover the problem, but keys remain locked in cluster and this is fatal to my application.
Thanks
It looks like the client had lost the ownership of the lock because of its disconnection from the cluster. You can use IMap#forceUnlock API in cases such as you faced. It releases the lock regardless of the lock owner and it always successfully unlocks, never blocks, and returns immediately.
Update
I dug deeper in dcm4che's source code and found that an IncompatibleConnectionException is thrown if either
a connection is "not installed"
or the types of protocols are not set or don't match.
I don't know what it means that a connection is "installed" but this flag can be set manually, so I set it for both the local and remote connections to true (even checked them with getInstalled() whether they are "installed" - and yes they are now - previously this property was null).
And as to the protocols, they weren't specified, so for both connections I set them to DICOM.
Results: I still get the same Exception.
I'd like to establish a DICOM association between dcm4chee (2.18.3) and my JAVA application using the dcm4che (5.12.0) toolkit.
The problem is that it doesn't seem to be any documentation available on how to use dcm4che in a JAVA application, so all I can do is read dcm4che's source code and try to figure out what its classes and methods are for, but I'm stuck. If someone already has a working example it would be very helpful.
So far I have:
import org.dcm4che3.net.ApplicationEntity;
import org.dcm4che3.net.Association;
import org.dcm4che3.net.Connection;
import org.dcm4che3.net.Device;
import org.dcm4che3.net.pdu.AAssociateRQ;
import org.dcm4che3.net.pdu.PresentationContext;
...
ApplicationEntity locAE = new ApplicationEntity();
locAE.setAETitle("THIS_JAVA_APP");
Connection localConn = new Connection();
localConn.setCommonName("loc_conn");
localConn.setHostname("localhost");
localConn.setPort(11112);
localConn.setProtocol(Connection.Protocol.DICOM);
localConn.setInstalled(true);
locAE.addConnection(localConn);
ApplicationEntity remAE = new ApplicationEntity();
remAE.setAETitle("DCM4CHEE");
Connection remoteConn = new Connection();
remoteConn.setCommonName("rem_conn");
remoteConn.setHostname("localhost");
remoteConn.setPort(11112);
remoteConn.setProtocol(Connection.Protocol.DICOM);
remoteConn.setInstalled(true);
remAE.addConnection(remoteConn);
AAssociateRQ assocReq = new AAssociateRQ();
assocReq.setCalledAET(remAE.getAETitle());
assocReq.setCallingAET(locAE.getAETitle());
assocReq.setApplicationContext("1.2.840.10008.3.1.1.1");
assocReq.setImplClassUID("1.2.40.0.13.1.3");
assocReq.setImplVersionName("dcm4che-5.12.0");
assocReq.setMaxPDULength(16384);
assocReq.setMaxOpsInvoked(0);
assocReq.setMaxOpsPerformed(0);
assocReq.addPresentationContext(new PresentationContext(
1, "1.2.840.10008.1.1", "1.2.840.10008.1.2"));
Device device = new Device("device");
device.addConnection(localConn);
device.addApplicationEntity(locAE);
Association assoc = locAE.connect(remAE, assocReq);
but I don't know whether I'm on the right path doing it.
The error I get:
org.dcm4che3.net.IncompatibleConnectionException: No compatible connection to DCM4CHEE available on THIS_JAVA_APP
at org.dcm4che3.net.ApplicationEntity.findCompatibelConnection(ApplicationEntity.java:646)
at org.dcm4che3.net.ApplicationEntity.connect(ApplicationEntity.java:651)
Could it be, that You are missing a Device instance from Your setup? It seems, that You need a Device, to which You attach both ApplicationEntity and Connection.
Looking at FindSCU.java source from dcm4che source.
private final Device device = new Device("findscu");
private final ApplicationEntity ae = new ApplicationEntity("FINDSCU");
private final Connection conn = new Connection();
public FindSCU() throws IOException {
device.addConnection(conn);
device.addApplicationEntity(ae);
ae.addConnection(conn);
}
I also think, that maybe the local Connection object can be instantiated without any parameters as the FindSCU example here demonstrates. Maybe the parameters are confusing it somehow, especially considering, that you have both local and remote connections pointing to localhost:11112.
But yes, one has to agree, that the documentation for dcm4che3 API is totally inadequate.
Here is the working code: (I don't know if it's the minimal solution, feel free to experiment with it...)
ApplicationEntity locAE = new ApplicationEntity();
locAE.setAETitle("THIS_JAVA_APP");
locAE.setInstalled(true);
Connection localConn = new Connection();
localConn.setCommonName("loc_conn");
localConn.setHostname("localhost");
localConn.setPort(11112);
localConn.setProtocol(Connection.Protocol.DICOM);
localConn.setInstalled(true);
locAE.addConnection(localConn);
ApplicationEntity remAE = new ApplicationEntity();
remAE.setAETitle("DCM4CHEE");
remAE.setInstalled(true);
Connection remoteConn = new Connection();
remoteConn.setCommonName("rem_conn");
remoteConn.setHostname("localhost");
remoteConn.setPort(11112);
remoteConn.setProtocol(Connection.Protocol.DICOM);
remoteConn.setInstalled(true);
remAE.addConnection(remoteConn);
AAssociateRQ assocReq = new AAssociateRQ();
assocReq.setCalledAET(remAE.getAETitle());
assocReq.setCallingAET(locAE.getAETitle());
assocReq.setApplicationContext("1.2.840.10008.3.1.1.1");
assocReq.setImplClassUID("1.2.40.0.13.1.3");
assocReq.setImplVersionName("dcm4che-5.12.0");
assocReq.setMaxPDULength(16384);
assocReq.setMaxOpsInvoked(0);
assocReq.setMaxOpsPerformed(0);
assocReq.addPresentationContext(new PresentationContext(
1, "1.2.840.10008.1.1", "1.2.840.10008.1.2"));
Device device = new Device("device");
device.addConnection(localConn);
device.addApplicationEntity(locAE);
Executor exec = (Runnable command) -> {};
device.setExecutor(exec);
Association assoc = locAE.connect(localConn, remoteConn, assocReq);
And the relevant dcm4chee log:
2018-03-02 23:21:42,832 INFO THIS_JAVA_APP->DCM4CHEE (TCPServer-1) [org.dcm4cheri.net.FsmImpl] received AAssociateRQ
appCtxName: 1.2.840.10008.3.1.1.1/DICOM Application Context Name
implClass: 1.2.40.0.13.1.3
implVersion: dcm4che-5.12.0
calledAET: DCM4CHEE
callingAET: THIS_JAVA_APP
maxPDULen: 16378
asyncOpsWindow:
pc-1: as=1.2.840.10008.1.1/Verification SOP Class
ts=1.2.840.10008.1.2/Implicit VR Little Endian
2018-03-02 23:21:42,843 INFO THIS_JAVA_APP->DCM4CHEE (TCPServer-1) [org.dcm4cheri.net.FsmImpl] sending AAssociateAC
appCtxName: 1.2.840.10008.3.1.1.1/DICOM Application Context Name
implClass: 1.2.40.0.13.1.1.1
implVersion: dcm4che-1.4.34
calledAET: DCM4CHEE
callingAET: THIS_JAVA_APP
maxPDULen: 16352
asyncOpsWindow:
pc-1: 0 - acceptance
ts=1.2.840.10008.1.2/Implicit VR Little Endian
After you have the association, see this other post for how to perform a C-FIND.
Edit
Apparently, I solved the problem. Changing the executor from
Executor exec = (Runnable command) -> {};
device.setExecutor(exec);
to
ExecutorService executorService = Executors.newSingleThreadExecutor();
ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
device.setExecutor(executorService);
device.setScheduledExecutor(scheduledExecutorService);
made it so my application correctly received the association response from the server. This might serve as reference for someone else.
Thank you for sharing your code. It was really helpful to me.
Original Post
I am unable to perform the connection with a code similar to the solution you proposed. I am trying to request an association with a dcm4chee-arc-light with dcm4che (both 5.14.1), and I have as it follows:
Device device = new Device(deviceName);
ApplicationEntity locAE = new ApplicationEntity(localAE);
Connection conn = new Connection();
Connection remote = new Connection();
AAssociateRQ rq = new AAssociateRQ();
device.addConnection(conn);
device.addApplicationEntity(locAE);
locAE.addConnection(conn);
ApplicationEntity remAE = new ApplicationEntity();
remAE.setAETitle(remoteAE);
remote.setCommonName("rem_conn");
remote.setHostname(remoteIP);
remote.setPort(remotePort);
remote.setProtocol(Connection.Protocol.DICOM);
remAE.addConnection(remote);
rq.setCalledAET(remAE.getAETitle());
rq.setCallingAET(locAE.getAETitle());
rq.setApplicationContext("1.2.840.10008.3.1.1.1");
rq.setImplClassUID("1.2.40.0.13.1.3");
rq.setImplVersionName("dcm4che-5.14.1");
rq.setMaxPDULength(16384);
rq.setMaxOpsInvoked(0);
rq.setMaxOpsPerformed(0);
rq.addPresentationContext(new PresentationContext(
1, "1.2.840.10008.5.1.4.1.2.2.1", "1.2.840.10008.1.2"));
Executor exec = (Runnable command) -> {};
device.setExecutor(exec);
//Opens association and connects to remote server
Association as = locAE.connect(conn, remote, rq);
But when trying to connect to a remote AET, it doesn't seem to receive the AAssociation response from the remote AET. My Java application hangs in Sta5 (waiting for association response) while the server hangs in Sta6 (ready for data transfer).
Java log:
[main] INFO org.dcm4che3.net.Connection - Initiate connection from 0.0.0.0/0.0.0.0:0 to localhost:11112
[main] INFO org.dcm4che3.net.Connection - Established connection Socket[addr=localhost/127.0.0.1,port=11112,localport=50101]
[main] DEBUG org.dcm4che3.net.Association - /127.0.0.1:50101>localhost/127.0.0.1:11112(1): enter state: Sta4 - Awaiting transport connection opening to complete
[main] INFO org.dcm4che3.net.Association - DEVICEAE->DCMQRSCP(1) << A-ASSOCIATE-RQ
[main] DEBUG org.dcm4che3.net.Association - A-ASSOCIATE-RQ[
calledAET: DCMQRSCP
callingAET: DEVICEAE
applicationContext: 1.2.840.10008.3.1.1.1 - DICOM Application Context Name
implClassUID: 1.2.40.0.13.1.3
implVersionName: dcm4che-5.14.1
maxPDULength: 16378
maxOpsInvoked/maxOpsPerformed: 1/1
PresentationContext[id: 1
as: 1.2.840.10008.5.1.4.1.2.2.1 - Study Root Query/Retrieve Information Model - FIND
ts: 1.2.840.10008.1.2 - Implicit VR Little Endian
]
]
[main] DEBUG org.dcm4che3.net.Association - DEVICEAE->DCMQRSCP(1): enter state: Sta5 - Awaiting A-ASSOCIATE-AC or A-ASSOCIATE-RJ PDU
Server log:
19:11:29,397 INFO - Accept connection Socket[addr=/127.0.0.1,port=50101,localport=11112]
19:11:29,397 DEBUG - /127.0.0.1:11112<-/127.0.0.1:50101(3): enter state: Sta2 - Transport connection open
19:11:29,416 INFO - DCMQRSCP<-DEVICEAE(3) >> A-ASSOCIATE-RQ
19:11:29,416 DEBUG - A-ASSOCIATE-RQ[
calledAET: DCMQRSCP
callingAET: DEVICEAE
applicationContext: 1.2.840.10008.3.1.1.1 - DICOM Application Context Name
implClassUID: 1.2.40.0.13.1.3
implVersionName: dcm4che-5.14.1
maxPDULength: 16378
maxOpsInvoked/maxOpsPerformed: 1/1
PresentationContext[id: 1
as: 1.2.840.10008.5.1.4.1.2.2.1 - Study Root Query/Retrieve Information Model - FIND
ts: 1.2.840.10008.1.2 - Implicit VR Little Endian
]
]
19:11:29,419 DEBUG - DCMQRSCP<-DEVICEAE(3): enter state: Sta3 - Awaiting local A-ASSOCIATE response primitive
19:11:29,419 INFO - DCMQRSCP<-DEVICEAE(3) << A-ASSOCIATE-AC
19:11:29,419 DEBUG - A-ASSOCIATE-AC[
calledAET: DCMQRSCP
callingAET: DEVICEAE
applicationContext: 1.2.840.10008.3.1.1.1 - DICOM Application Context Name
implClassUID: 1.2.40.0.13.1.3
implVersionName: dcm4che-5.14.1
maxPDULength: 16378
maxOpsInvoked/maxOpsPerformed: 1/1
PresentationContext[id: 1
result: 0 - acceptance
ts: 1.2.840.10008.1.2 - Implicit VR Little Endian
]
]
19:11:29,427 DEBUG - DCMQRSCP<-DEVICEAE(3): enter state: Sta6 - Association established and ready for data transfer
I feel like I am missing something, but I cannot find the source of the problem. Any help is appreciated, as I am still new to dcm4che and DICOM protocol.
Thank you.
I am trying tranquility with Druid 0.11 and Kafka. When tranquility receive new data it throw the following exception:
2018-01-12 18:27:34,010 [Curator-ServiceCache-0] INFO c.m.c.s.net.finagle.DiscoResolver - Updating instances for service[firehose:druid:overlord:flow-018-0000-0000] to Set(ServiceInstance{name='firehose:druid:overlord:flow-018-0000-0000', id='ea85b248-0c53-4ec1-94a6-517525f72e31', address='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, sslPort=-1, payload=null, registrationTimeUTC=1515781653895, serviceType=DYNAMIC, uriSpec=null})
Jan 12, 2018 6:27:37 PM com.twitter.finagle.netty3.channel.ChannelStatsHandler exceptionCaught
WARNING: ChannelStatsHandler caught an exception
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:622)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
at org.jboss.netty.channel.SimpleChannelHandler.connectRequested(SimpleChannelHandler.java:306)
The worker was created by middle Manager:
2018-01-12T18:27:25,704 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Submitting runnable for task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
2018-01-12T18:27:25,719 INFO [WorkerTaskMonitor] io.druid.indexing.worker.WorkerTaskMonitor - Affirmative. Running task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
And tranquility talk with overlord fine... I think by the following logs:
2018-01-12T18:27:25,268 INFO [qtp271944754-62] io.druid.indexing.overlord.TaskLockbox - Adding task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] to activeTasks
2018-01-12T18:27:25,272 INFO [TaskQueue-Manager] io.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_realtime_flow_2018-01-12T18:00:00.000Z_0_0
2018-01-12T18:27:25,272 INFO [TaskQueue-Manager] io.druid.indexing.overlord.RemoteTaskRunner - Added pending task index_realtime_flow_2018-01-12T18:00:00.000Z_0_0
2018-01-12T18:27:25,279 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - No worker selection strategy set. Using default of [EqualDistributionWorkerSelectStrategy]
2018-01-12T18:27:25,294 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Coordinator asking Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] to add task[index_realtime_flow_2018-01-12T18:00:00.000Z_0_0]
2018-01-12T18:27:25,334 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.RemoteTaskRunner - Task index_realtime_flow_2018-01-12T18:00:00.000Z_0_0 switched from pending to running (on [druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091])
2018-01-12T18:27:25,336 INFO [rtr-pending-tasks-runner-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] status changed to [RUNNING].
2018-01-12T18:27:25,747 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] wrote RUNNING status for task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] on [TaskLocation{host='null', port=-1, tlsPort=-1}]
2018-01-12T18:27:25,829 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.RemoteTaskRunner - Worker[druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local:8091] wrote RUNNING status for task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] on [TaskLocation{host='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, tlsPort=-1}]
2018-01-12T18:27:25,829 INFO [Curator-PathChildrenCache-1] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_flow_2018-01-12T18:00:00.000Z_0_0] location changed to [TaskLocation{host='druid-md-deployment-7877777bf7-tmmvh.druid-md-hs.default.svc.cluster.local', port=8100, tlsPort=-1}].
What's wrong? I tried a thousand things and nothing solves it ...
Thanks a lot
UnresolvedAddressException being hit by Druid broker
You have to have all the druid cluster information set in you servers running tranquility.
It's because you only get DNS of you druid cluster from zookeeper, not the IP.
For example, on linux server, save you cluster information in /etc/hosts.
I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.