I see that Akka Actor Systems can be distributed across multiple JVMs that might not even be running on the same piece of hardware. If I understand this correctly, then it seems that you could have a distributed Actor system where 1 group of actors is on myapp01, another group is on myapp02 (say, 2 vSphere VMs running on your local data center), and yet a 3rd group of actors running on AWS. So first, if anything about what I just said isn't true/accurate, please start by correcting me!
If everything I've stated up until this point is more or less accurate, then I'm wondering how to actually "glue" all these distributed actors "groups" (not sure what the right term is: JVM, Actor System, Actor Pool, Actor Cluster, etc.) together such that work can be farmed out to any of them, and a FizzActor living on the AWS node can then send a message to a BuzzActor living on myapp02, etc.
For instance, sticking with the example above (2 vSphere VMs and an AWS machine) how could I deploy an actor group/system/pool/cluster to each of these such that they all know about each other and distribute the work between them?
My guess is that Akka allows you to configure the hosts/ports of all the different "nodes" in the Actor System;
My next guess is that this configuration is limited in the sense that you have to update each node's configuration every time you add/remove/modify another node in the cluster (otherwise how could the Akka nodes "know" about a new one, or "know" that we just shut down the AWS machine?);
My final guess is that thie limitation can be averted by bringing something like Apache ZooKeeper into the mix, and somehow treat each node as a separate peer in the distributed system, then use ZooKeeper to coordinate/connect/link/load balance between all the peers/nodes
Am I on track or way off base?
Related
I’m working on an application that often queries a very large number of actors and hence sends / receives a very large number of messages. When the application is ran on a single machine this is not an issue because the messages are sent within the boundaries of a single JVM which is quite fast. However, when I run the application on multiple nodes (using akka-cluster) each node hosts part of these actors and the messages go over the network which becomes extremely slow.
One solution that I came up with is to have a ManagerActor on each node where the application is ran. This will greatly minimize the number of messages exchanged (i.e. instead of sending thousands of messages to each of the actors, if we run the application on 3 nodes we send 3 messages - one for each ManagerActor which then sends messages within the current JVM to the other (thousands of) actors which is very fast). However, I’m fairly new to Akka and I’m not quite sure that such a solution makes sense. Do you see any drawbacks of it? Any other options which are better / more native to Akka?
You could use Akka's Distributed Publish-Subscribe to achieve that. That way you simply start a manager actor on each node the usual way, have them subscribe to a topic, and then publish messages to them using that topic topic. There is a simple example of this in the docs linked above.
I am looking for a Java solution beside big memory and hazelcast. Since we are using Hadoop/Spark we should have access to Zookeeper.
So I just want to know if there is a solution satisfying our needs or do we need to build something ourself.
What I need are reliable objects that are inmemory, replicated and synchronized. For manipulation I would like to have lock support and atomic actions spanning an object.
I also need support for object references and List/Set/Map support.
The rest we can build on ourself.
The idea is simply having self organizing network that configures itself based on the environment and that is best done by synchronized objects that are replicated and one can listen to.
Hazelcast, has a split-brain detector in place and when split-brain happens hazelcast will continue to accept updates and when the cluster is merged back, it will give you an ability to merge the updates that you preferred.
We are implementing a cluster quorum feature, which will hopefully available in the next minor (3.5) version. With cluster quorum you can define a minimum threshold or a custom function of your own to decide whether cluster should continue to operate or not in a partitioned network.
For example, if you define a quorum size of 3, if there is a less than 3 members in the cluster, the cluster will stop operating.
Currently hazelcast behaves like an AP solution, but when the cluster quorum is available you can tune hazelcast to behave like a CP solution.
I and new to Akka & want to achieve this,
want to deploy few Stateful actors on fixed machines (Which will be always on) and stateless actors (Processing actors/workers) on Amazon EC2 Spot instances
Now to handle failover of Stateful actors deciding to use Akka persistance.
And to distribute job on stateless workers deciding to use RoundRobinPool with remotely deployed routees. And want messages to be passed to least utilized machine (CPU & Memory). Using Pool so that I can use withSupervisorStrategy() for handling Actor Failure.
I am going through example for Remote Deployeed Routees & refering to this code http://www.typesafe.com/activator/template/akka-sample-cluster-java. And https://github.com/akka/akka/blob/cb05725c1ec8a09e9bfd57dd093911dd41c7b288/akka-samples/akka-sample-cluster-java/src/main/java/sample/cluster/stats/StatsSampleOneMasterMain.java.
In StatsSampleClient it is randomly taking node & passing message. I want to pass it least utilized maching as mentioned above. I want to know whether Akka support this or I will have write code to find out utilization & send message to that machine accordingly.
Kindly suggest if any better approach can used for what I have mentioned above.
Thanks!
-Devendra
Did you have a look at the Adaptive Load Balancing Router?
It performs load balancing of messages to cluster nodes based on the cluster metrics data by configuration.
Hope it helps.
I am new to Hadoop/ZooKeeper. I cannot understand the purpose of using ZooKeeper with Hadoop, is ZooKeeper writing data in Hadoop? If not, then why we do we use ZooKeeper with Hadoop?
Hadoop 1.x does not use Zookeeper. HBase does use zookeeper even in Hadoop 1.x installations.
Hadoop adopted Zookeeper as well starting with version 2.0.
The purpose of Zookeeper is cluster management. This fits with the general philosophy of *nix of using smaller specialized components - so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.
Zookeeper is a distributed storage that provides the following guarantees (copied from Zookeeper overview page):
Sequential Consistency - Updates from a client will be applied in the
order that they were sent.
Atomicity - Updates either succeed or
fail. No partial results.
Single System Image - A client will see the
same view of the service regardless of the server that it connects
to.
Reliability - Once an update has been applied, it will persist
from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be
up-to-date within a certain time bound.
You can use these to implement different "recipes" that are required for cluster management like locks, leader election etc.
If you're going to use ZooKeeper yourself, I recommend you take a look at Curator from Netflix which makes it easier to use (e.g. they implement a few recipes out of the box)
Zookeeper solves the problem of reliable distributed coordination, and hadoop is a distributed system, right?
There's an excellent paper Paxos Algorithm that you can read on this subject.
From zookeeper documentation page:
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.
Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.
From hadoop documentation page:
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
Regarding your query:
Why do we need ZooKeeper in Hadoop Stack?
The binding factor is distributed processing and high availability.
e.g. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
From Apache documentation link on HDFSHighAvailabilityWithQJM:
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
Zookeeper has been used to avoid Split - brain scenario. You can find role of Zookeeper in below question:
How does Hadoop Namenode failover process works?
I need to simulate a system in Java where there is a master and a number of workers. Each worker may process its data locally but needs to communicate the master to read data from other nodes. And workers should work concurrently.
How can I simulate this system? Do I need to start a new thread for every running worker and a master thread? Is there another way?
If you want to do it on a single machine then I see two options:
Create a master and a worker application (make sure that you can run multiple instances of those). Run one master application and multiple instances of the worker application.
Create a single application in which you have a single instance of your Master class and multiple instances of your Worker class. Let the Master run in a separate thread and let each Worker run in its own thread too.
So the first option is to run each "node" (master or worker) as a separate process, while the second option is to run each "node" as a separate thread.
This is a pretty generic question which is open to many architectural solutions. I'd like to present the one I've used in the past. I used RMI here for ease of remote calls.
All master and slave processes are RMI services. Both master and slaves are spawned using the RMI daemon (RMID) which have the bonus feature of "upping" the services in case one goes down due to a JVM crash or any other "abnormal" reason. RMI services in general work based on an interface which defines the contract between the client and the server. Let's say for e.g. that I have to write a service which solves an equation.
We start off with creating two services: master and slave. Both these services implement/expose the same interface to the client. The only difference would be that the "master" service would be solely responsible for "forking" across work to the different slave agents, getting the response back (or re-arranging them if required) and returning it to the client. The master is a simple RMI service which accepts the list of "equations" and splits them across the different clients. Obviously, the master here has references to all the clients it governs in terms of RMI handle (i.e. communication between master and slaves is again a RMI invocation).
Here again, we have a lot of possibilities of configuring how master "looks up" the slaves but I'm sure you can work it out quite easily. This architecture has the advantage of a grid based solution wherein you are not limited by a single process to do all the work for you and hence gain the resiliency and freedom from monolithic heap sizes for your JVM process.
I really haven't used them but Rio and JINI is something you should look into if you want to build distributed systems in Java.