Now we building a realtime analytics system and it should be highly distributed. We plan to use distributed locks and counters to ensure data consistency, and we need a some kind of distributed map to know which client is connected to which server.
I have no prior experience in distributed systems before, but I think we have two options:
Java+Hazelcast
Golang+ETCD
But what is the pros/cons of each other in topic context?
Hazelcast and etcd are two very different systems. The reason is the CAP theorem.
The CAP theorem states that no distributed system can have Consistency, Availability, and Partition-tolerance. Distributed systems normally fall closer to CA or CP. Hazelcast is an AP system, and etcd (being a Raft implementation) is CP. So, your choice is between consistency and availability/performance.
In general, Hazelcast will be much more performant and be able to handle more failures than Raft and etcd, but at the cost of potential data loss or consistency issues. The way Hazelcast works is it partitions data and stores pieces of the data on different nodes. So, in a 5 node cluster, the key "foo" may be stored on nodes 1 and 2, and bar may be stored on nodes 3 and 4. You can control the number of nodes to which Hazelcast replicates data via the Hazelcast and map configuration. However, during a network or other failure, there is some risk that you'll see old data or even lose data in Hazelcast.
Alternatively, Raft and etcd is a single-leader highly consistent system that stores data on all nodes. This means it's not ideal for storing large amounts of state. But even during a network failure, etcd can guarantee that your data will remain consistent. In other words, you'll never see old/stale data. But this comes at a cost. CP systems require that a majority of the cluster be alive to operate normally.
The consistency issue may or may not be relevant for basic key-value storage, but it can be extremely relevant to locks. If you're expecting your locks to be consistent across the entire cluster - meaning only one node can hold a lock even during a network or other failure - do not use Hazelcast. Because Hazelcast sacrifices consistency in favor of availability (again, see th CAP theorem), it's entirely possible that a network failure can lead two nodes to believe a lock is free to be acquired.
Alternatively, Raft guarantees that during a network failure only one node will remain the leader of the etcd cluster, and therefore all decisions are made through that one node. This means that etcd can guarantee it has a consistent view of the cluster state at all times and can ensure that something like a lock can only be obtained by a single process.
Really, you need to consider what you are looking for in your database and go seek it out. The use cases for CP and AP data stores are vastly different. If you want consistency for storing small amounts of state, consistent locks, leader elections, and other coordination tools, use a CP system like ZooKeeper or Consul. If you want high availability and performance at the potential cost of consistency, use Hazelcast or Cassandra or Riak.
Source: I am the author of a Raft implementation
Although this question is now over 3 years old, I'd like to inform subsequent readers that Hazelcast as of 3.12 has a CP based subsystem (based on Raft) for its Atomics and Concurrency APIs. There are plans to roll out CP to more of Hazelcast data structures in the near future. Giving Hazelcast users a true choice between AP and CP concerns and allowing users to apply Hazelcast to new use cases previously handled by systems such as etcd and Zookeeper.
You can read more here...
https://hazelcast.com/blog/hazelcast-imdg-3-12-beta-is-released/
Related
Is it considered bad practice to use separated, local cache for each node in distributed microservice application? I've heard that in monolithic application it's OK to use local EHCache as 2nd level cache provider for Hibernate, but in distributed environment it's common practice to use distributed caches, such as Memcached, Redis or Hazelcast. What are the consequences of using separated cache for each node?
"There are only two hard problems in Computer Science:cache invalidation and naming things."-- Phil Karlton
The main problem with local cache in app-server is that it makes cache invalidation much more hard that it was before.
Each time a resource change, it has to be invalidated (and updated) on all the local caches. This would require a system that knows about all the cache servers running at any point of time. This system would have to be informed about all updates so that it can co-ordinate the data invalidation on all servers. It will also have to take care of retries, handling failed servers, etc.
If your application server has it's own local cache, you will have to solve these problems yourselves using a separate system or in the application code. A distributed caching system, would have solved those problems for you. You can make an update call and on success have a guarantee of data consistency (or eventual consistency).
It is separation of concerns. With a separate cache cluster, the caching logic and the associated problems are handled at one place. The same cluster can be reused for multiple applications easily, rather than redoing the same for each application you develop.
Another minor disadvantage is that you would have to warm up the cache each time you spawn a new server, if you don't want a performance degradation. This would lead to longer time to spawn servers.
Here we can do one more thing is to use message broker for cache invalidation.
Use kafka or any other queue to catch packets and invalidate them.
Let say i have a array of memcache server, the memcache client will make sure the the cache entry is only on a single memcache server and all client will always ask that server for the cache entry... right ?
Now Consider two scenarios:
[1] web-server's are getting lots of different request(different urls) then the cache entry will be distributed among the memcache server and request will fan out to memcache cluster.
In this case the memcache strategy to keep single cache entry on a single server works.
[2] web-server's are getting lots of request for the same resource then all request from the web-server will land on a single memcache server which is not desired.
What i am looking for is the distributed cache in which:
[1] Each web-server can specify which cache node to use to cache stuff.
[2] If any web-server invalidate a cache then the cache server should invalidate it from all caching nodes.
Can memcache fulfill this usecase ?
PS: I dont have ton of resouces to cache , but i have small number of resource with a lots of traffic asking for a single resource at once.
Memcache is a great distributed cache. To understand where the value is stored, it's a good idea to think of the memcache cluster as a hashmap, with each memcached process being precisely one pigeon hole in the hashmap (of course each memcached is also an 'inner' hashmap, but that's not important for this point). For example, the memcache client determines the memcache node using this pseudocode:
index = hash(key) mod len(servers)
value = servers[index].get(key)
This is how the client can always find the correct server. It also highlights how important the hash function is, and how keys are generated - a bad hash function might not uniformly distribute keys over the different servers…. The default hash function should work well in almost any practical situation, though.
Now you bring up in issue [2] the condition where the requests for resources are non-random, specifically favouring one or a few servers. If this is the case, it is true that the respective nodes are probably going to get a lot more requests, but this is relative. In my experience, memcache will be able to handle a vastly higher number of requests per second than your web server. It easily handles 100's of thousands of requests per second on old hardware. So, unless you have 10-100x more web servers than memcache servers, you are unlikely to have issues. Even then, you could probably resolve the issue by upgrading the individual nodes to have more CPUs or more powerful CPUs.
But let us assume the worst case - you can still achieve this with memcache by:
Install each memcache as a single server (i.e. not as a distributed cache)
In your web server, you are now responsible for managing the connections to each of these servers
You are also responsible for determining which memcached process to pass each key/value to, achieving goal 1
If a web server detects a cache invalidation, it should loop over the servers invalidating the cache on each, thereby achieving goal 2
I personally have reservations about this - you are, by specification, disabling the distributed aspect of your cache, and the distribution is a key feature and benefit of the service. Also, your application code would start to need to know about the individual cache servers to be able to treat each differently which is undesirable architecturally and introduces a large number of new configuration points.
The idea of any distributed cache is to remove the ownership of the location(*) from the client. Because of this, distributed caches and DB do not allow the client to specify the server where the data is written.
In summary, unless your system is expecting 100,000k or more requests per second, it's doubtful that you will this specific problem in practice. If you do, scale the hardware. If that doesn't work, then you're going to be writing your own distribution logic, duplication, flushing and management layer over memcache. And I'd only do that if really, really necessary. There's an old saying in software development:
There are only two hard things in Computer Science: cache invalidation
and naming things.
--Phil Karlton
(*) Some distributed caches duplicate entries to improve performance and (additionally) resilience if a server fails, so data may be on multiple servers at the same time
I’m learning Zookeeper and so far I don't understand the purpose of using it for distributed systems that databases can't solve.
The use cases I’ve read are implementing a lock, barrier, etc for distributed systems by having Zookeeper clients read/write to Zookeeper servers. Can’t the same be achieved by read/write to databases?
For example my book describes the way to implement a lock with Zookeeper is to have Zookeeper clients who want to acquire the lock create an ephemeral znode with a sequential flag set under the lock-znode. Then the lock is owned by the client whose child znode has the lowest sequence number.
All other Zookeeper examples in the book are again just using it to store/retrieve values.
It seems the only thing that differs Zookeeper from a database/any storage is the “watcher” concept. But that can be built using something else.
I know my simplified view of Zookeeper is a misunderstanding. So can someone tell me what Zookeeper truly provides that a database/custom watcher can’t?
Can’t the same be achieved by read/write to databases?
In theory, yes it is possible, but usually, it is not a good idea to use databases for demanding usecases of distributed coordination. I have seen microservices using relational databases for managing distributed locks with very bad consequences (e.g. thousands of deadlocks in the databases) which in turn resulted in poor DBA-developer relation :-)
Zookeeper has some key characteristics which make it a good candidate for managing application metadata
Possibility to scale horizontally by adding new nodes to ensemble
Data is guaranteed to be eventually consistent within a certain timebound. It is possible to have strict consistency at a higher cost if clients desire it (Zookeeper is a CP system in CAP terms)
Ordering guarantee -- all clients are guaranteed to be able to read data in the order in which they have been written
All of the above could be achieved by databases, but only with significant effort from application clients. Also watches and ephemeral nodes could be achieved by databases by using techniques such as triggers, timeouts etc. But they are often considered inefficient or antipatterns.
Relational databases offer strong transactional guarantees which usually come at a cost but are often not required for managing application metadata. So it make sense to look for a more specialized solution such as Zookeeper or Chubby.
Also, Zookeeper stores all its data in memory (which limits its usecases), resulting in highly performant reads. This is usually not the case with most databases.
I think you're asking yourself the wrong question when you try to figure out the purpose of Zookeeper, instead of asking what Zookeeper can do that "databases" can not do (btw Zookeeper is also a database) ask what Zookeeper is better at than other available databases. If you start to ask yourself that question you will hopefully understand why people decide to use Zookeeper in their distributed services.
Take ephemeral nodes for example, the huge benefit of using them is not that they make a much better lock than some other way. The benefit of using ephemeral nodes is that they will automatically be removed if the client loses connection to Zookeeper.
And then we can have a look at the CAP theorem where Zookeeper closest resembles a CP system. And you must once again decide if this is what you want out of your database.
tldr: Zookeeper is better in some aspects and worse in others compared to other databases.
Late in the party. Just to provide another thought:
Yes, it's quite common to use SQL database for server coordinations in production. However, you will likely be asked to build a HA (high availability) system, right? So your SQL DB will have to be HA. That means you will need the leader-follower architecture (a follower SQL DB), follower will need to be promoted to the leader if the leader dies (MHA nodes + manager), when the previous leader is back to life it must know that it's no longer the leader. These questions have answers but will cost engineer effort to set them up. So Zookeeper is invented.
I sometimes consider Zookeeper as a simplified version of HA SQL cluster with a subset of functionalities.
Similarly, why people choose to use NoSQL VS SQL. With the proper partitioning, SQL can also scale well, right? So why NoSQL. One motivation is to reduce the effort level in case of handling node failures. When a NoSQL node is dead, it can automatically fallback to another node and even trigger the data migration. But if one of your SQL partition leader is died, it usually requires manual treatment. This is like SQL VS Zookeeper. Someone coded up the HA + failover logic for you, so we can lay back, hopefully, in case of inevitable node failures.
ZooKeeper writes are linearizable.
Linearizable means that all operations are totally ordered.
Total order means that for every operation a and b,
Either a happened before b, or b happened before a.
Linearizability is the strongest consistency level.
Most databases give up on linearizability because it affects performance, and offer weaker consistency guarantees instead - e.g casuality (casual order).
ZooKeeper uses this to implement an atomic broadcast algorithm, which is equivalent to consensus.
I am looking for a Java solution beside big memory and hazelcast. Since we are using Hadoop/Spark we should have access to Zookeeper.
So I just want to know if there is a solution satisfying our needs or do we need to build something ourself.
What I need are reliable objects that are inmemory, replicated and synchronized. For manipulation I would like to have lock support and atomic actions spanning an object.
I also need support for object references and List/Set/Map support.
The rest we can build on ourself.
The idea is simply having self organizing network that configures itself based on the environment and that is best done by synchronized objects that are replicated and one can listen to.
Hazelcast, has a split-brain detector in place and when split-brain happens hazelcast will continue to accept updates and when the cluster is merged back, it will give you an ability to merge the updates that you preferred.
We are implementing a cluster quorum feature, which will hopefully available in the next minor (3.5) version. With cluster quorum you can define a minimum threshold or a custom function of your own to decide whether cluster should continue to operate or not in a partitioned network.
For example, if you define a quorum size of 3, if there is a less than 3 members in the cluster, the cluster will stop operating.
Currently hazelcast behaves like an AP solution, but when the cluster quorum is available you can tune hazelcast to behave like a CP solution.
What are the basic principles of how two separable computers connected within the same network running the same Java application maintain the same state by syncing their heap between each other?
I believe Terracotta does this task but I have no idea how would some pseudo code look like that would describe its core functions.
I'm just looking for understanding of this technology.
Terracotta DSO works by manipulating the byte code of your classes (and the JDK's classes etc). The instructions on how and when to do this are part of the Terracotta configuration file.
The bytecode modification looks for certain byte codes such as a field read or write or a monitor enter or exit. Whenever those instructions occur, code is added around that location that does the appropriate action in the distributed store. For example when a monitor is obtained due to synchronization, a distributed lock is obtained as well (whether it is a read or write lock is dependent on the configuration). If a field in a shared object is written, the distributed system must verify that a write lock is being held and then send the data value is sent to the clustered server, which stores it on disk or shares it over the network as appropriate.
Note that Terracotta does not share the entire heap, only the graph of objects indicated by the configuration. In general, there would be little point in sharing an entire heap. It is better instead for the application to describe the domain objects needed across the distributed application.
There are many optimizations employed to make the operations above efficient: only field deltas are sent over the wire and in a form much more efficient than Java serialization, many deltas can be bundled and sent in batches, locks are actually "checked out" to a particular client so that if the application data is partitioned across clients, most distributed locks are actually a local operation not involving a network call, etc.
Terracotta can indeed handle that if you tell it to - see the description of its DSO - Distributed Shared Objects.
It sounds cool but I would prefer something like EHcache (can be backed by Terracotta again) which functions on a bit more high level.
One emerging technology that somehow tackles this problem is Distributed Software Transactional Memory. You get strong data consistency guarantees (i.e. 1-copy serializability) and a powerful concurrency control mechanism: transactions.
AFAIK, there is no mature solution out there, but it is promising.
I would recommend that you investigate http://www.jboss.org/infinispan and see if it will fulfill your needs.