Why do we need ZooKeeper in the Hadoop stack?

Why do we need ZooKeeper in the Hadoop stack? - java

I am new to Hadoop/ZooKeeper. I cannot understand the purpose of using ZooKeeper with Hadoop, is ZooKeeper writing data in Hadoop? If not, then why we do we use ZooKeeper with Hadoop?

Hadoop 1.x does not use Zookeeper. HBase does use zookeeper even in Hadoop 1.x installations.
Hadoop adopted Zookeeper as well starting with version 2.0.
The purpose of Zookeeper is cluster management. This fits with the general philosophy of *nix of using smaller specialized components - so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.
Zookeeper is a distributed storage that provides the following guarantees (copied from Zookeeper overview page):
Sequential Consistency - Updates from a client will be applied in the
order that they were sent.
Atomicity - Updates either succeed or
fail. No partial results.
Single System Image - A client will see the
same view of the service regardless of the server that it connects
to.
Reliability - Once an update has been applied, it will persist
from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be
up-to-date within a certain time bound.
You can use these to implement different "recipes" that are required for cluster management like locks, leader election etc.
If you're going to use ZooKeeper yourself, I recommend you take a look at Curator from Netflix which makes it easier to use (e.g. they implement a few recipes out of the box)

Zookeeper solves the problem of reliable distributed coordination, and hadoop is a distributed system, right?
There's an excellent paper Paxos Algorithm that you can read on this subject.

From zookeeper documentation page:
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.
Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.
From hadoop documentation page:
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
Regarding your query:
Why do we need ZooKeeper in Hadoop Stack?
The binding factor is distributed processing and high availability.
e.g. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
From Apache documentation link on HDFSHighAvailabilityWithQJM:
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
Zookeeper has been used to avoid Split - brain scenario. You can find role of Zookeeper in below question:
How does Hadoop Namenode failover process works?

Related

What's the purpose of using Zookeeper rather than just databases for managing distributed systems?

I’m learning Zookeeper and so far I don't understand the purpose of using it for distributed systems that databases can't solve.
The use cases I’ve read are implementing a lock, barrier, etc for distributed systems by having Zookeeper clients read/write to Zookeeper servers. Can’t the same be achieved by read/write to databases?
For example my book describes the way to implement a lock with Zookeeper is to have Zookeeper clients who want to acquire the lock create an ephemeral znode with a sequential flag set under the lock-znode. Then the lock is owned by the client whose child znode has the lowest sequence number.
All other Zookeeper examples in the book are again just using it to store/retrieve values.
It seems the only thing that differs Zookeeper from a database/any storage is the “watcher” concept. But that can be built using something else.
I know my simplified view of Zookeeper is a misunderstanding. So can someone tell me what Zookeeper truly provides that a database/custom watcher can’t?

Can’t the same be achieved by read/write to databases?
In theory, yes it is possible, but usually, it is not a good idea to use databases for demanding usecases of distributed coordination. I have seen microservices using relational databases for managing distributed locks with very bad consequences (e.g. thousands of deadlocks in the databases) which in turn resulted in poor DBA-developer relation :-)
Zookeeper has some key characteristics which make it a good candidate for managing application metadata
Possibility to scale horizontally by adding new nodes to ensemble
Data is guaranteed to be eventually consistent within a certain timebound. It is possible to have strict consistency at a higher cost if clients desire it (Zookeeper is a CP system in CAP terms)
Ordering guarantee -- all clients are guaranteed to be able to read data in the order in which they have been written
All of the above could be achieved by databases, but only with significant effort from application clients. Also watches and ephemeral nodes could be achieved by databases by using techniques such as triggers, timeouts etc. But they are often considered inefficient or antipatterns.
Relational databases offer strong transactional guarantees which usually come at a cost but are often not required for managing application metadata. So it make sense to look for a more specialized solution such as Zookeeper or Chubby.
Also, Zookeeper stores all its data in memory (which limits its usecases), resulting in highly performant reads. This is usually not the case with most databases.

I think you're asking yourself the wrong question when you try to figure out the purpose of Zookeeper, instead of asking what Zookeeper can do that "databases" can not do (btw Zookeeper is also a database) ask what Zookeeper is better at than other available databases. If you start to ask yourself that question you will hopefully understand why people decide to use Zookeeper in their distributed services.
Take ephemeral nodes for example, the huge benefit of using them is not that they make a much better lock than some other way. The benefit of using ephemeral nodes is that they will automatically be removed if the client loses connection to Zookeeper.
And then we can have a look at the CAP theorem where Zookeeper closest resembles a CP system. And you must once again decide if this is what you want out of your database.
tldr: Zookeeper is better in some aspects and worse in others compared to other databases.

Late in the party. Just to provide another thought:
Yes, it's quite common to use SQL database for server coordinations in production. However, you will likely be asked to build a HA (high availability) system, right? So your SQL DB will have to be HA. That means you will need the leader-follower architecture (a follower SQL DB), follower will need to be promoted to the leader if the leader dies (MHA nodes + manager), when the previous leader is back to life it must know that it's no longer the leader. These questions have answers but will cost engineer effort to set them up. So Zookeeper is invented.
I sometimes consider Zookeeper as a simplified version of HA SQL cluster with a subset of functionalities.
Similarly, why people choose to use NoSQL VS SQL. With the proper partitioning, SQL can also scale well, right? So why NoSQL. One motivation is to reduce the effort level in case of handling node failures. When a NoSQL node is dead, it can automatically fallback to another node and even trigger the data migration. But if one of your SQL partition leader is died, it usually requires manual treatment. This is like SQL VS Zookeeper. Someone coded up the HA + failover logic for you, so we can lay back, hopefully, in case of inevitable node failures.

ZooKeeper writes are linearizable.
Linearizable means that all operations are totally ordered.
Total order means that for every operation a and b,
Either a happened before b, or b happened before a.
Linearizability is the strongest consistency level.
Most databases give up on linearizability because it affects performance, and offer weaker consistency guarantees instead - e.g casuality (casual order).
ZooKeeper uses this to implement an atomic broadcast algorithm, which is equivalent to consensus.

Hazelcast (Java) and ETCD (golang) differences/similarities?

Now we building a realtime analytics system and it should be highly distributed. We plan to use distributed locks and counters to ensure data consistency, and we need a some kind of distributed map to know which client is connected to which server.
I have no prior experience in distributed systems before, but I think we have two options:
Java+Hazelcast
Golang+ETCD
But what is the pros/cons of each other in topic context?

Hazelcast and etcd are two very different systems. The reason is the CAP theorem.
The CAP theorem states that no distributed system can have Consistency, Availability, and Partition-tolerance. Distributed systems normally fall closer to CA or CP. Hazelcast is an AP system, and etcd (being a Raft implementation) is CP. So, your choice is between consistency and availability/performance.
In general, Hazelcast will be much more performant and be able to handle more failures than Raft and etcd, but at the cost of potential data loss or consistency issues. The way Hazelcast works is it partitions data and stores pieces of the data on different nodes. So, in a 5 node cluster, the key "foo" may be stored on nodes 1 and 2, and bar may be stored on nodes 3 and 4. You can control the number of nodes to which Hazelcast replicates data via the Hazelcast and map configuration. However, during a network or other failure, there is some risk that you'll see old data or even lose data in Hazelcast.
Alternatively, Raft and etcd is a single-leader highly consistent system that stores data on all nodes. This means it's not ideal for storing large amounts of state. But even during a network failure, etcd can guarantee that your data will remain consistent. In other words, you'll never see old/stale data. But this comes at a cost. CP systems require that a majority of the cluster be alive to operate normally.
The consistency issue may or may not be relevant for basic key-value storage, but it can be extremely relevant to locks. If you're expecting your locks to be consistent across the entire cluster - meaning only one node can hold a lock even during a network or other failure - do not use Hazelcast. Because Hazelcast sacrifices consistency in favor of availability (again, see th CAP theorem), it's entirely possible that a network failure can lead two nodes to believe a lock is free to be acquired.
Alternatively, Raft guarantees that during a network failure only one node will remain the leader of the etcd cluster, and therefore all decisions are made through that one node. This means that etcd can guarantee it has a consistent view of the cluster state at all times and can ensure that something like a lock can only be obtained by a single process.
Really, you need to consider what you are looking for in your database and go seek it out. The use cases for CP and AP data stores are vastly different. If you want consistency for storing small amounts of state, consistent locks, leader elections, and other coordination tools, use a CP system like ZooKeeper or Consul. If you want high availability and performance at the potential cost of consistency, use Hazelcast or Cassandra or Riak.
Source: I am the author of a Raft implementation

Although this question is now over 3 years old, I'd like to inform subsequent readers that Hazelcast as of 3.12 has a CP based subsystem (based on Raft) for its Atomics and Concurrency APIs. There are plans to roll out CP to more of Hazelcast data structures in the near future. Giving Hazelcast users a true choice between AP and CP concerns and allowing users to apply Hazelcast to new use cases previously handled by systems such as etcd and Zookeeper.
You can read more here...
https://hazelcast.com/blog/hazelcast-imdg-3-12-beta-is-released/

Distributed Transactional Memory with Zookeeper / Hadoop?

I am looking for a Java solution beside big memory and hazelcast. Since we are using Hadoop/Spark we should have access to Zookeeper.
So I just want to know if there is a solution satisfying our needs or do we need to build something ourself.
What I need are reliable objects that are inmemory, replicated and synchronized. For manipulation I would like to have lock support and atomic actions spanning an object.
I also need support for object references and List/Set/Map support.
The rest we can build on ourself.
The idea is simply having self organizing network that configures itself based on the environment and that is best done by synchronized objects that are replicated and one can listen to.

Hazelcast, has a split-brain detector in place and when split-brain happens hazelcast will continue to accept updates and when the cluster is merged back, it will give you an ability to merge the updates that you preferred.
We are implementing a cluster quorum feature, which will hopefully available in the next minor (3.5) version. With cluster quorum you can define a minimum threshold or a custom function of your own to decide whether cluster should continue to operate or not in a partitioned network.
For example, if you define a quorum size of 3, if there is a less than 3 members in the cluster, the cluster will stop operating.
Currently hazelcast behaves like an AP solution, but when the cluster quorum is available you can tune hazelcast to behave like a CP solution.

How to "link" distributed Akka actor systems?

I see that Akka Actor Systems can be distributed across multiple JVMs that might not even be running on the same piece of hardware. If I understand this correctly, then it seems that you could have a distributed Actor system where 1 group of actors is on myapp01, another group is on myapp02 (say, 2 vSphere VMs running on your local data center), and yet a 3rd group of actors running on AWS. So first, if anything about what I just said isn't true/accurate, please start by correcting me!
If everything I've stated up until this point is more or less accurate, then I'm wondering how to actually "glue" all these distributed actors "groups" (not sure what the right term is: JVM, Actor System, Actor Pool, Actor Cluster, etc.) together such that work can be farmed out to any of them, and a FizzActor living on the AWS node can then send a message to a BuzzActor living on myapp02, etc.
For instance, sticking with the example above (2 vSphere VMs and an AWS machine) how could I deploy an actor group/system/pool/cluster to each of these such that they all know about each other and distribute the work between them?
My guess is that Akka allows you to configure the hosts/ports of all the different "nodes" in the Actor System;
My next guess is that this configuration is limited in the sense that you have to update each node's configuration every time you add/remove/modify another node in the cluster (otherwise how could the Akka nodes "know" about a new one, or "know" that we just shut down the AWS machine?);
My final guess is that thie limitation can be averted by bringing something like Apache ZooKeeper into the mix, and somehow treat each node as a separate peer in the distributed system, then use ZooKeeper to coordinate/connect/link/load balance between all the peers/nodes
Am I on track or way off base?

What ways exist to distribute asynchronous batch tasks?

I am currently investigating what Java compatible solutions exist to address my requirements as follows:
Timer based / Schedulable tasks to batch process
Distributed, and by that providing the ability to scale horizontally
Resilience, no SPFs please
The nature of these tasks (heavy XML generation, and the delivery to web based receiving nodes) means running them on a single server using something like Quartz isn't viable.
I have heard of technologies like Hadoop and JavaSpaces which have addressed the scaling and resilience end of the problem effectively. Not knowing whether these are quite suited to my requirements, its hard to know what other technologies might fit well.
I was wondering really what people in this space felt were options available, and how each plays its strengths, or suits certain problems better than others.
NB: Its worth noting that schedule-ability is perhaps a hangover from how we do things presently. Yes there are tasks which ought to go at certain times. It has also been used to throttle throughput at times when no mandate for set times exists.

Asynchronous always brings JMS to mind for me. Send the request message to a queue; a MessageListener is plucked out of the pool to handle it.
This can scale, because the queue and listener can be on a remote server. The size of the listener thread pool can be configured. You can have different listeners for different tasks.
UPDATE: You can avoid having a single point of failure by clustering and load balancing.
You can get JMS without cost using ActiveMQ (open source), JBOSS (open source version available), or any Java EE app server, so budget isn't a consideration.
And no lock-in, because you're using JMS, besides the fact that you're using Java.
I'd recommend doing it with Spring message driven POJOs. The community edition is open source, of course.
If that doesn't do it for you, have a look at Spring Batch and Spring Integration. Both of those might be useful, and the community editions are open source.

Have you looked into GridGain? I am pretty sure it won't solve the scheduling problem, but you can scale it and it happens like "magic", the code to be executed is sent to a node and it is executed in there. It works fine when you don't have a database connection to be sent (or anything that is not serializable).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.