I am trying to create a "directory" in Zookeeper like this:
curatorFramework = CuratorFrameworkFactory.newClient(
"ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181",
zkInfo.getSessionTimeoutMs(),
zkInfo.getConnectionTimeoutMs(),
new RetryNTimes(zkInfo.getRetryAttempts(),
zkInfo.getRetryIntervalMs())
);
curatorFramework.start();
byte[] byteArray = new byte[1];
byteArray[0] = (byte) 7;
curatorFramework.create()
.withMode(CreateMode.PERSISTENT)
.withACL(ZooDefs.Ids.OPEN_ACL_UNSAFE)
.forPath("/my_node", byteArray);
Perplexingly, it is giving me a "NoNodeException" on the very node I'm trying to create.
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /my_node
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) ~[stormjar.jar:?]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[stormjar.jar:?]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[stormjar.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1176) ~[stormjar.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) ~[stormjar.jar:?]
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) ~[stormjar.jar:?]
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) ~[stormjar.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153) ~[stormjar.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607) ~[stormjar.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597) ~[stormjar.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl$3.forPath(CreateBuilderImpl.java:362) ~[stormjar.jar:?]
at org.apache.curator.framework.imps.CreateBuilderImpl$3.forPath(CreateBuilderImpl.java:310) ~[stormjar.jar:?]
Note that I am able to connect to Zookeeper:
Socket connection established to ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181, initiating session
Session establishment complete on server ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181, sessionid = 0x100000363b13354, negotiated timeout = 20000
Please note that the Zookeeper server is on a remote machine and the ip ("111.11.111.1") has been changed in this post.
So I tried connecting with zkCli and found that it was my connectString that was the problem:
/opt/zookeeper-3.6.1/bin/zkCli.sh -server ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181
Welcome to ZooKeeper!
[myid:ip-111-11-111-1.us-west-2.compute.internal:2181] - INFO [main-SendThread(ip-111-11-111-1.us-west-2.compute.internal:2181):ClientCnxn$SendThread#1154] - Opening socket connection to server ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181.
[myid:ip-111-11-111-1.us-west-2.compute.internal:2181] - INFO [main-SendThread(ip-111-11-111-1.us-west-2.compute.internal:2181):ClientCnxn$SendThread#986] - Socket connection established, initiating session, client: /111.11.111.2:43736, server: ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181
[myid:ip-111-11-111-1.us-west-2.compute.internal:2181] - INFO [main-SendThread(ip-111-11-111-1.us-west-2.compute.internal:2181):ClientCnxn$SendThread#1420] - Session establishment complete on server ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181, session id = 0x100000363b1d2e8, negotiated timeout = 30000
[zk: ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181(CONNECTED) 0] ls /
Node does not exist: /
[zk: ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181(CONNECTED) 1] create /my_node
Node does not exist: /my_node
As you can see, trying to create a node gives you a NoNode error in zkCli.
It turns out that ip-111-11-111-1.us-west-2.compute.internal/111.11.111.1:2181 was not a correct connect string, but I was confused because zkCli allowed me to connect to it.
Seems related to ACLs, just to be sure, you could manually create it.
Locate your local zk binaries (doesn't need to be on the remote host) and launch the client (zkCli) pointing at your server. Once connected, create the new znode:
bin/zkCli.sh -server 111.11.111.1:2181
[zkshell:x] create /my_node
>>Created /mynode
The shell should output the last sentence in order to guarantee the node has been created. Once done, launch again the Curator process.
Take a look here for more detailed info about the zk client.
Related
I am starting to study how can I implement an application supporting Failover/FaultTolerance on top of JMS, more precisely EMS
I configured two EMS servers working both with FaultTolerance enabled:
For EMS running on server on server1 I have
in tibemsd.conf
ft_active = tcp://server2:7232
in factories.conf
[GenericConnectionFactory]
type = generic
url = tcp://server1:7232
[FTTopicConnectionFactory]
type = topic
url = tcp://server1:7232,tcp://server2:7232
[FTQueueConnectionFactory]
type = queue
url = tcp://server1:7232,tcp://server2:7232
And for EMS running on server on server2 I have
in tibemsd.conf
ft_active = tcp://server1:7232
in factories.conf
[GenericConnectionFactory]
type = generic
url = tcp://server2:7232
[FTTopicConnectionFactory]
type = topic
url = tcp://server2:7232,tcp://server1:7232
[FTQueueConnectionFactory]
type = queue
url = tcp://server2:7232,tcp://server1:7232
I am not a TIBCO EMS expert but my config seems to be good: When I start EMS on server1 I get:
$ tibemsd -config tibemsd.conf
...
2022-07-20 23:04:58.566 Server is active.
2022-07-20 23:05:18.563 Standby server 'SERVERNAME#server1' has connected.
then if I start EMS on server2, I get
$ tibemsd -config tibemsd.conf
...
2022-07-20 23:05:18.564 Accepting connections on tcp://server2:7232.
2022-07-20 23:05:18.564 Server is in standby state for 'tcp://server1:7232'
Moreover, if I kill active EMS on server1, I immediately get the following message on server2:
2022-07-20 23:21:52.891 Connection to active server 'tcp://server1:7232' has been lost.
2022-07-20 23:21:52.891 Server activating on failure of 'tcp://server1:7232'.
...
2022-07-20 23:21:52.924 Server is now active.
Until here, everything looks OK, active/standby EMS servers seems to be correctly configured
Things get more complicated when I write a piece of code how is supposed to connect to these EMS servers and to periodically publish messages. Let's try with the following code sample:
#Test
public void testEmsFailover() throws JMSException, InterruptedException {
int NB = 1000;
TibjmsConnectionFactory factory = new TibjmsConnectionFactory();
factory.setServerUrl("tcp://server1:7232,tcp://server2:7232");
Connection connection = factory.createConnection();
Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
connection.start();
for (int i = 0; i < NB; i++) {
LOG.info("sending message");
Queue queue = session.createQueue(QUEUE__CLIENT_TO_FRONTDOOR__CONNECTION_REQUEST);
MessageProducer producer = session.createProducer(queue);
MapMessage mapMessage = session.createMapMessage();
mapMessage.setStringProperty(PROPERTY__CLIENT_KIND, USER.toString());
mapMessage.setStringProperty(PROPERTY__CLIENT_NAME, "name");
producer.send(mapMessage);
LOG.info("done!");
Thread.sleep(1000);
}
}
If I run this code while both active and standby servers are up, everything looks good
23:26:32.431 [main] INFO JmsEndpointTest - sending message
23:26:32.458 [main] INFO JmsEndpointTest - done!
23:26:33.458 [main] INFO JmsEndpointTest - sending message
23:26:33.482 [main] INFO JmsEndpointTest - done!
Now If I kill the active EMS server, I would expect that
the standby server would instantaneously become the active one
my code would continue to publish such as if nothing had happened
However, in my code I get the following error:
javax.jms.JMSException: Connection is closed
at com.tibco.tibjms.TibjmsxLink.sendRequest(TibjmsxLink.java:307)
at com.tibco.tibjms.TibjmsxLink.sendRequestMsg(TibjmsxLink.java:261)
at com.tibco.tibjms.TibjmsxSessionImp._createProducer(TibjmsxSessionImp.java:1004)
at com.tibco.tibjms.TibjmsxSessionImp.createProducer(TibjmsxSessionImp.java:4854)
at JmsEndpointTest.testEmsFailover(JmsEndpointTest.java:103)
...
and in the logs of the server (the previous standby server supposed to be now the active one) I get
2022-07-20 23:32:44.447 [anonymous#cersei]: connect failed: server not in active state
2022-07-20 23:33:02.969 Connection to active server 'tcp://server2:7232' has been lost.
2022-07-20 23:33:02.969 Server activating on failure of 'tcp://server2:7232'.
2022-07-20 23:33:02.969 Server rereading configuration.
2022-07-20 23:33:02.971 Recovering state, please wait.
2022-07-20 23:33:02.980 Recovered 46 messages.
2022-07-20 23:33:02.980 Server is now active.
2022-07-20 23:33:03.545 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:04.187 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:04.855 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:05.531 [anonymous#cersei]: reconnect failed: connection unknown for id=8
I would appreciate any help to enhance my code
Thank you
I think I found the origin of my problem:
according to the page Tibco-Ems Failover Issue, the error message
reconnect failed: connection unknown for id=8
means: "the store (ems db) was'nt share between the active and the standby node, so when the active ems failed, the new active ems was'nt able to recover connections and messages."
I realized that it is painful to configure a shared store. To avoid it, I configured two tibems on the same host, by following the page Step By Step How to Setup TIBCO EMS In Fault Tolerant Mode:
two tibemsd.conf configuration files
configure a different listen port in each file
configure ft_active with url of other server
configure factories.conf
By doing so, I can replay my test and it works as expected
I'm trying to establish Oracle DB connection from the application which runs on an EC2 instance of AWS. Oracle DB is in on-prem server. Firewall has been opened and I'm able to telnet to the SCAN and VIP hosts of that DB from my EC2 instance. But, still I'm getting the below exception:
Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:392) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:434) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:687) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
at oracle.net.ns.NSProtocol.connect(NSProtocol.java:343) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1102) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:320) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
... 239 common frames omitted
Caused by: java.net.UnknownHostException: <<hostname>>: unknown error
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[na:1.8.0_40-internal]
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:907) ~[na:1.8.0_40-internal]
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1302) ~[na:1.8.0_40-internal]
at java.net.InetAddress.getAllByName0(InetAddress.java:1255) ~[na:1.8.0_40-internal]
at java.net.InetAddress.getAllByName(InetAddress.java:1171) ~[na:1.8.0_40-internal]
at java.net.InetAddress.getAllByName(InetAddress.java:1105) ~[na:1.8.0_40-internal]
at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:117) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
at oracle.net.nt.ConnOption.connect(ConnOption.java:133) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:370) ~[ojdbc6-11.2.0.3.jar:11.2.0.3.0]
JDBC URL:
jdbc:oracle:thin:#(DESCRIPTION =(ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)
(HOST = <<IP address of the host>>)(PORT = 1590)))(LOAD_BALANCE = yes)
(CONNECT_DATA =(SERVICE_NAME = <<example.service.com>>)(FAILOVER_MODE =(TYPE = SELECT)
(METHOD = BASIC))))`
Are you using java 8 ? If so then probably it's related to this bug
Java 7 or 9 would probably give you a more useful error message instead of "unknown error" (eg possibly "Name or service not known")
Apart from that, did you try a tnsping from the host you're trying to connect from ?
Also as per the below in the Oracle driver documentation when using a tnsnames entry in the jdbc URL it should be as below, using OCI driver :
Note that you can also specify the database by a TNSNAMES entry. You can find the available TNSNAMES entries listed in the file tnsnames.ora on the client computer from which you are connecting. For example, if you want to connect to the database on host myhost as user scott with password tiger that has a TNSNAMES entry of MyHostString, enter:
Connection conn = DriverManager.getConnection
("jdbc:oracle:oci8:#MyHostString","scott","tiger");
We have a cluster of Hazelcast nodes all running on one remote system (single physical system with many nodes). We would like to connect to this cluster from an external client - a Java application which uses code as below to connect to Hazelcast:
ClientConfig clientConfig = new ClientConfig();
clientConfig.addAddress(config.getHost() + ":" + config.getPort());
client = HazelcastClient.newHazelcastClient(clientConfig);
where, host is the IP of remote and port is 5701.
This still connects to the local host (127.0.0.1). What am I missing?
Edit:
If the java client is the only hazelcast app running on the local system, it fails to connect and throws the exception: java.lang.IllegalStateException: Cannot get initial partitions!
From the logs:
14:58:26.717 [main] INFO c.m.b.p.s.s.HazelcastCacheClient - creating
new Hazelcast instance
14:58:26.748 [main] INFO com.hazelcast.core.LifecycleService -
HazelcastClient[hz.client_0_dev][3.2.1] is STARTING
14:58:27.029 [main] INFO com.hazelcast.core.LifecycleService -
HazelcastClient[hz.client_0_dev][3.2.1] is STARTED
14:58:27.061 [hz.client_0_dev.cluster-listener] INFO
com.hazelcast.core.LifecycleService -
HazelcastClient[hz.client_0_dev][3.2.1] is CLIENT_CONNECTED
14:58:27.061 [hz.client_0_dev.cluster-listener] INFO
c.h.client.spi.ClientClusterService -
Members [5] { Member [127.0.0.1]:5701 Member [127.0.0.1]:5702
Member [127.0.0.1]:5703 Member [127.0.0.1]:5704 Member
[127.0.0.1]:5705 }
14:58:47.278 [main] ERROR c.h.c.spi.ClientPartitionService - Error
while fetching cluster partition table!
com.hazelcast.spi.exception.RetryableIOException:
java.util.concurrent.ExecutionException:
com.hazelcast.core.HazelcastException: java.net.ConnectException:
Connection refused: no further information ... Caused by:
java.util.concurrent.ExecutionException:
com.hazelcast.core.HazelcastException: java.net.ConnectException:
Connection refused: no further information
at java.util.concurrent.FutureTask.report(Unknown Source)
~[na:1.8.0_31]
at java.util.concurrent.FutureTask.get(Unknown Source) ~[na:1.8.0_31]
at
com.hazelcast.client.connection.nio.ClientConnectionManagerImpl.getOrConnect(ClientConnectionManagerImpl.java:282)
~[BRBASE-service-manager-1.0.0-jar-with-dependencies.jar:na]
... 14 common frames omitted
Caused by: com.hazelcast.core.HazelcastException:
java.net.ConnectException: Connection refused: no further information
at com.hazelcast.util.ExceptionUtil.rethrow(ExceptionUtil.java:45)
~[BRBASE-service-manager-1.0.0-jar-with-dependencies.jar:na] ...
To connect to the remote cluster, make sure the cluster uses the external IP and not 127.0.0.1. In our case we have a single physical system, with multiple nodes, with tcp-ip mode enabled. The hazelcast.xml has the configuration:
<tcp-ip enabled="true">
<!-- This should be external IP -->
<interface>172.x.x.x</interface>
</tcp-ip>
Can you try:
ClientConfig config = new ClientConfig();
config.getNetworkConfig().addAddress(host + ":" + port);
HazelcastInstance instance = HazelcastClient.newHazelcastClient(config);
If you want to connect to multiple ip's running Hazelcast as cluster add below to your client config and then instantiate client.
//configure client properties
ClientConfig config = new ClientConfig();
String[] addresses = {"172.20.250.118" + ":" + "5701","172.20.250.49" + ":" + "5701"};
config.getNetworkConfig().addAddress(addresses);
//start Hazelcast client
HazelcastInstance hazelcastInstance = HazelcastClient.newHazelcastClient(config);
I am using mobicents CAP implementation to build a CAP application for CAMEL charging; my application is working fine and I can send and receive messages between SSF and SCF.
What I am looking for is: how to detect it the SCTP link between client and server has been broken? Because when I deliberately stop the server and keep the client running, on sending CAPDialog doesn't raise any error.
When I stop the server, on console I see following exception:
2015-04-14 13:15:29,669 [Thread-0 ] ERROR org.mobicents.protocols.sctp.SelectorThread - Exception while finishing connection for Association=clientAsscoiation
java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at org.mobicents.protocols.sctp.SelectorThread.finishConnectionTcp(SelectorThread.java:407)
at org.mobicents.protocols.sctp.SelectorThread.finishConnection(SelectorThread.java:368)
at org.mobicents.protocols.sctp.SelectorThread.run(SelectorThread.java:151)
at java.lang.Thread.run(Unknown Source)
And during this if I try to send a CAPDialog e.g. send an oAnswer event, it simple doesn't tell if the request was successful or not (which ideally should return a failure in my case)
OAnswerSpecificInfo oAnswerSpecificInfo = this.getCapProvider().getCAPParameterFactory().createOAnswerSpecificInfo(null,
false, false, null, null, null);
ReceivingSideID legID = this.getCapProvider().getCAPParameterFactory().createReceivingSideID(LegType.leg2);
MiscCallInfo miscCallInfo = this.getCapProvider().getINAPParameterFactory().createMiscCallInfo(MiscCallInfoMessageType.notification, null);
EventSpecificInformationBCSM eventSpecificInformationBCSM = this.getCapProvider()
.getCAPParameterFactory()
.createEventSpecificInformationBCSM(oAnswerSpecificInfo);
CAPDialogCircuitSwitchedCall capDialog = (CAPDialogCircuitSwitchedCall) getCapProvider().getCAPDialog(localDialogId);
capDialog.addEventReportBCSMRequest(EventTypeBCSM.oAnswer, eventSpecificInformationBCSM, legID, miscCallInfo, null);
capDialog.setUserObject(getParamAsString("referenceId"));
capDialog.setReturnMessageOnError(true);
capDialog.send();
"capDialog.send();" code does not return back an error even when any underlying SS7 link is down.
If you have access to SCTP stack you can check if SCTP connection is up or down:
Association association = ...;
association.isConnected();
I'm running a standalone Spark cluster on EC2, and I'm writing a application using Spark-Cassandra connector driver and try to submit job to Spark cluster programmatically.
The job itself is simple:
public static void main(String[] args) {
SparkConf conf;
JavaSparkContext sc;
conf = new SparkConf()
.set("spark.cassandra.connection.host", host);
conf.set("spark.driver.host", "[my_public_ip]");
conf.set("spark.driver.port", "15000");
sc = new JavaSparkContext("spark://[spark_master_host]","test",conf);
CassandraJavaRDD<CassandraRow> rdd = javaFunctions(sc).cassandraTable(
"keyspace", "table");
System.out.println(rdd.first().toString());
sc.stop();
}
Which runs fine when I run that in the Spark Master node of my EC2 cluster.
I'm trying to running this in a remote Windows client.
The problem was from these two lines:
conf.set("spark.driver.host", "[my_public_ip]");
conf.set("spark.driver.port", "15000");
First, if i comment out these 2 lines, application would not throw a exception, but the Executor is not running, with following log:
14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/3 is now LOADING
14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/0 is now EXITED (Command exited with code 1)
14/12/06 22:40:03 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141207033931-0021/0 removed: Command exited with code 1
Which never ends, when I check the worker node log, I found:
14/12/06 22:40:21 ERROR security.UserGroupInformation: PriviledgedActionException as:[username] cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
... 7 more
I've no idea what that's about, my guess is that probably worker node could not connect to driver, which probably initially set as:
14/12/06 22:39:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#[some_host_name]:52660]
14/12/06 22:39:30 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver#[some_host_name]:52660]
Obviously, no DNS is going to resolve my host name...
Since I can't set deploy mode to "client" or "cluster", if not via ./spark-submit script.(Which I think that's absurd...). I try to add a host resolution "XX.XXX.XXX.XX [host-name]" in /etc/hosts of all Spark Master Worker nodes.
No luck of course...
That leads me to the second, un-comment that two line;
Which gives me:
14/12/06 22:59:41 INFO Remoting: Starting remoting
14/12/06 22:59:41 ERROR Remoting: Remoting error: [Startup failed] [
akka.remote.RemoteTransportException: Startup failed
at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
at akka.remote.Remoting.start(Remoting.scala:194)
...
Cause:
Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: /[my_public_ip]:15000
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:391)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:388)
I double checked my firewall setting and router setting, confirm that my firewall is diabled; and netstat -an to confirm port 15000 is not in use (in fact I tried to change to several available port, no luck); and I ping my public ip from both other machine and machine from my cluster, no problem.
Now I'm utterly screw up, I just run out of idea try to fix this. Any suggestions? Any help is appreciated!
Please check if 15000 is in your security group.