Distributed Transactional Memory with Zookeeper / Hadoop?

Distributed Transactional Memory with Zookeeper / Hadoop? - java

I am looking for a Java solution beside big memory and hazelcast. Since we are using Hadoop/Spark we should have access to Zookeeper.
So I just want to know if there is a solution satisfying our needs or do we need to build something ourself.
What I need are reliable objects that are inmemory, replicated and synchronized. For manipulation I would like to have lock support and atomic actions spanning an object.
I also need support for object references and List/Set/Map support.
The rest we can build on ourself.
The idea is simply having self organizing network that configures itself based on the environment and that is best done by synchronized objects that are replicated and one can listen to.

Hazelcast, has a split-brain detector in place and when split-brain happens hazelcast will continue to accept updates and when the cluster is merged back, it will give you an ability to merge the updates that you preferred.
We are implementing a cluster quorum feature, which will hopefully available in the next minor (3.5) version. With cluster quorum you can define a minimum threshold or a custom function of your own to decide whether cluster should continue to operate or not in a partitioned network.
For example, if you define a quorum size of 3, if there is a less than 3 members in the cluster, the cluster will stop operating.
Currently hazelcast behaves like an AP solution, but when the cluster quorum is available you can tune hazelcast to behave like a CP solution.

Related

Keeping all instance of in memory graph db in sync

We are building an java application which will use embedded Neo4j for graph traversal. Below are the reasons why we want to use embedded version instead of centralized server
This app is not a data owner. Data will be ingested on it through other app. Keeping data locally will help us in doing quick calculation and hence it will improve our api sla.
Since data foot print is small we don't want to maintain centralized server which will incur additional cost and maintenance.
No need for additional cache
Now this architecture bring two challenges. First How to update data in all instance of embedded Neo4j application at same time. Second how to make sure that all instance are in sync i.e using same version of data.
We thought of using Kafka to solve first problem. Idea is to have kafka listener with different groupid(to ensure all get updates) in all instance . Whenever there is update, event will be posted in kafka. All instance will listen for event and will perform the update operation.
However we still don't have any solid design to solve second problem. For various reason one of the instance can miss the event (it's consumer is down). One of the way is to keep checking latest version by calling api of data owner app. If version is behind replay the events.But this brings additional complexity of maintaining the event logs of all updates. Do you guys think if it can be done in a better and simpler way?

Kafka consumers are extremely consistent and reliable once you have them configured properly, so there shouldn't be any reason for them to miss messages, unless there's an infrastructure problem, in which case any solution you architect will have problems. If the Kafka cluster is healthy (e.g. at least one of the copies of the data is available, and at least quorum zookeepers are up and running), then your consumers should receive every single message from the topics they're subscribed to. The consumer will handle the retries/reconnecting itself, as long as your timeout/retry configurations are sane. The default configs in the latest kafka versions are adequate 99% of the time.
Separately, you can add a separate thread, for example, that is constantly checking what the latest offset is per topic/partitions, and compare it to what the consumer has last received, and maybe issue an alert/warning if there is a discrepancy. In my experience, and with Kafka's reliability, it should be unnecessary, but it can give you peace of mind, and shouldn't be too difficult to add.

What's the purpose of using Zookeeper rather than just databases for managing distributed systems?

I’m learning Zookeeper and so far I don't understand the purpose of using it for distributed systems that databases can't solve.
The use cases I’ve read are implementing a lock, barrier, etc for distributed systems by having Zookeeper clients read/write to Zookeeper servers. Can’t the same be achieved by read/write to databases?
For example my book describes the way to implement a lock with Zookeeper is to have Zookeeper clients who want to acquire the lock create an ephemeral znode with a sequential flag set under the lock-znode. Then the lock is owned by the client whose child znode has the lowest sequence number.
All other Zookeeper examples in the book are again just using it to store/retrieve values.
It seems the only thing that differs Zookeeper from a database/any storage is the “watcher” concept. But that can be built using something else.
I know my simplified view of Zookeeper is a misunderstanding. So can someone tell me what Zookeeper truly provides that a database/custom watcher can’t?

Can’t the same be achieved by read/write to databases?
In theory, yes it is possible, but usually, it is not a good idea to use databases for demanding usecases of distributed coordination. I have seen microservices using relational databases for managing distributed locks with very bad consequences (e.g. thousands of deadlocks in the databases) which in turn resulted in poor DBA-developer relation :-)
Zookeeper has some key characteristics which make it a good candidate for managing application metadata
Possibility to scale horizontally by adding new nodes to ensemble
Data is guaranteed to be eventually consistent within a certain timebound. It is possible to have strict consistency at a higher cost if clients desire it (Zookeeper is a CP system in CAP terms)
Ordering guarantee -- all clients are guaranteed to be able to read data in the order in which they have been written
All of the above could be achieved by databases, but only with significant effort from application clients. Also watches and ephemeral nodes could be achieved by databases by using techniques such as triggers, timeouts etc. But they are often considered inefficient or antipatterns.
Relational databases offer strong transactional guarantees which usually come at a cost but are often not required for managing application metadata. So it make sense to look for a more specialized solution such as Zookeeper or Chubby.
Also, Zookeeper stores all its data in memory (which limits its usecases), resulting in highly performant reads. This is usually not the case with most databases.

I think you're asking yourself the wrong question when you try to figure out the purpose of Zookeeper, instead of asking what Zookeeper can do that "databases" can not do (btw Zookeeper is also a database) ask what Zookeeper is better at than other available databases. If you start to ask yourself that question you will hopefully understand why people decide to use Zookeeper in their distributed services.
Take ephemeral nodes for example, the huge benefit of using them is not that they make a much better lock than some other way. The benefit of using ephemeral nodes is that they will automatically be removed if the client loses connection to Zookeeper.
And then we can have a look at the CAP theorem where Zookeeper closest resembles a CP system. And you must once again decide if this is what you want out of your database.
tldr: Zookeeper is better in some aspects and worse in others compared to other databases.

Late in the party. Just to provide another thought:
Yes, it's quite common to use SQL database for server coordinations in production. However, you will likely be asked to build a HA (high availability) system, right? So your SQL DB will have to be HA. That means you will need the leader-follower architecture (a follower SQL DB), follower will need to be promoted to the leader if the leader dies (MHA nodes + manager), when the previous leader is back to life it must know that it's no longer the leader. These questions have answers but will cost engineer effort to set them up. So Zookeeper is invented.
I sometimes consider Zookeeper as a simplified version of HA SQL cluster with a subset of functionalities.
Similarly, why people choose to use NoSQL VS SQL. With the proper partitioning, SQL can also scale well, right? So why NoSQL. One motivation is to reduce the effort level in case of handling node failures. When a NoSQL node is dead, it can automatically fallback to another node and even trigger the data migration. But if one of your SQL partition leader is died, it usually requires manual treatment. This is like SQL VS Zookeeper. Someone coded up the HA + failover logic for you, so we can lay back, hopefully, in case of inevitable node failures.

ZooKeeper writes are linearizable.
Linearizable means that all operations are totally ordered.
Total order means that for every operation a and b,
Either a happened before b, or b happened before a.
Linearizability is the strongest consistency level.
Most databases give up on linearizability because it affects performance, and offer weaker consistency guarantees instead - e.g casuality (casual order).
ZooKeeper uses this to implement an atomic broadcast algorithm, which is equivalent to consensus.

Hazelcast (Java) and ETCD (golang) differences/similarities?

Now we building a realtime analytics system and it should be highly distributed. We plan to use distributed locks and counters to ensure data consistency, and we need a some kind of distributed map to know which client is connected to which server.
I have no prior experience in distributed systems before, but I think we have two options:
Java+Hazelcast
Golang+ETCD
But what is the pros/cons of each other in topic context?

Hazelcast and etcd are two very different systems. The reason is the CAP theorem.
The CAP theorem states that no distributed system can have Consistency, Availability, and Partition-tolerance. Distributed systems normally fall closer to CA or CP. Hazelcast is an AP system, and etcd (being a Raft implementation) is CP. So, your choice is between consistency and availability/performance.
In general, Hazelcast will be much more performant and be able to handle more failures than Raft and etcd, but at the cost of potential data loss or consistency issues. The way Hazelcast works is it partitions data and stores pieces of the data on different nodes. So, in a 5 node cluster, the key "foo" may be stored on nodes 1 and 2, and bar may be stored on nodes 3 and 4. You can control the number of nodes to which Hazelcast replicates data via the Hazelcast and map configuration. However, during a network or other failure, there is some risk that you'll see old data or even lose data in Hazelcast.
Alternatively, Raft and etcd is a single-leader highly consistent system that stores data on all nodes. This means it's not ideal for storing large amounts of state. But even during a network failure, etcd can guarantee that your data will remain consistent. In other words, you'll never see old/stale data. But this comes at a cost. CP systems require that a majority of the cluster be alive to operate normally.
The consistency issue may or may not be relevant for basic key-value storage, but it can be extremely relevant to locks. If you're expecting your locks to be consistent across the entire cluster - meaning only one node can hold a lock even during a network or other failure - do not use Hazelcast. Because Hazelcast sacrifices consistency in favor of availability (again, see th CAP theorem), it's entirely possible that a network failure can lead two nodes to believe a lock is free to be acquired.
Alternatively, Raft guarantees that during a network failure only one node will remain the leader of the etcd cluster, and therefore all decisions are made through that one node. This means that etcd can guarantee it has a consistent view of the cluster state at all times and can ensure that something like a lock can only be obtained by a single process.
Really, you need to consider what you are looking for in your database and go seek it out. The use cases for CP and AP data stores are vastly different. If you want consistency for storing small amounts of state, consistent locks, leader elections, and other coordination tools, use a CP system like ZooKeeper or Consul. If you want high availability and performance at the potential cost of consistency, use Hazelcast or Cassandra or Riak.
Source: I am the author of a Raft implementation

Although this question is now over 3 years old, I'd like to inform subsequent readers that Hazelcast as of 3.12 has a CP based subsystem (based on Raft) for its Atomics and Concurrency APIs. There are plans to roll out CP to more of Hazelcast data structures in the near future. Giving Hazelcast users a true choice between AP and CP concerns and allowing users to apply Hazelcast to new use cases previously handled by systems such as etcd and Zookeeper.
You can read more here...
https://hazelcast.com/blog/hazelcast-imdg-3-12-beta-is-released/

jpa l2 cache co ordination using rmi

We use eclipselink and weblogic
We have two websphere clusters, with 2 servers in each
Right now an app in 1 cluster uses rmi to do cache-coordination to keep 2 of those servers in synch
When we add a new app in the new cluster to the mix, we will have to synch the cache 2 clusters
How do I achieve this?
Can I still use jpa cache co ordination? using rmi? jms?
should I look into using coherence as l2 cache?
I dont need highly scale-able grid configurations. All I need to make sure is that cache has no stale data

Nothing is a sure thing to prevent stale data, so I hope you are using a form of optimistic locking where needed. You will have to evaluate what is the better solution for your 4 server architecture, but RMI, JMS and even just turning off the second level cache where stale data cannot be tolerated are valid options and would work. I recommend setting up simple tests that match your use cases, the expected load and evaluate if the network traffic and overhead of having to merge and maintain changes on the second level caches out weighs the cost of removing the second level cache. For highly volitile entities, that tipping point might come sooner, in which case you might have more benifit by disabling the shared cache for that entity.
In my experience, JMS has been easier to configure for cache coordination, as it is a central point all servers can connect to, where as RMI requires each server to maintain connections to every other server.

Why do we need ZooKeeper in the Hadoop stack?

I am new to Hadoop/ZooKeeper. I cannot understand the purpose of using ZooKeeper with Hadoop, is ZooKeeper writing data in Hadoop? If not, then why we do we use ZooKeeper with Hadoop?

Hadoop 1.x does not use Zookeeper. HBase does use zookeeper even in Hadoop 1.x installations.
Hadoop adopted Zookeeper as well starting with version 2.0.
The purpose of Zookeeper is cluster management. This fits with the general philosophy of *nix of using smaller specialized components - so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.
Zookeeper is a distributed storage that provides the following guarantees (copied from Zookeeper overview page):
Sequential Consistency - Updates from a client will be applied in the
order that they were sent.
Atomicity - Updates either succeed or
fail. No partial results.
Single System Image - A client will see the
same view of the service regardless of the server that it connects
to.
Reliability - Once an update has been applied, it will persist
from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be
up-to-date within a certain time bound.
You can use these to implement different "recipes" that are required for cluster management like locks, leader election etc.
If you're going to use ZooKeeper yourself, I recommend you take a look at Curator from Netflix which makes it easier to use (e.g. they implement a few recipes out of the box)

Zookeeper solves the problem of reliable distributed coordination, and hadoop is a distributed system, right?
There's an excellent paper Paxos Algorithm that you can read on this subject.

From zookeeper documentation page:
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.
Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.
From hadoop documentation page:
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
Regarding your query:
Why do we need ZooKeeper in Hadoop Stack?
The binding factor is distributed processing and high availability.
e.g. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
From Apache documentation link on HDFSHighAvailabilityWithQJM:
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
Zookeeper has been used to avoid Split - brain scenario. You can find role of Zookeeper in below question:
How does Hadoop Namenode failover process works?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.