Sorry, wall of text; there's a summary at the bottom.
I am prototyping a Java application that will run on multiple servers. Every instance has an embedded Infinispan cache and the caches are configured to form a cluster in replication mode. The cache entries are loaded from an external system only - there is no need for actively adding entries using cache.put(key, value).
For that purpose, I implemented a custom CacheLoader. Loading values on-demand is working, but these entries are not replicated to other active cluster nodes. For test reasons, I tried adding entries with 'put' - these are replicated immediately.
The user guide pointed me to properties that affect cluster behavior when nodes are joining/leaving the cluster, or during writes, like fetchPersistentState, shared and fetchInMemoryState. The latter is useful in my case, since new nodes joining the cluster should receive the current state. And this initial synchronzation during startup is even fetching the entries loaded by the cache loader.
fetchPersistentState caused errors because my cache loader does not implement AdvancedCacheLoader - but since the advanced methods do not seem to be called after an invocation of 'load' I do not think that correctly implementing that interface would solve my problem.
I have also read about the ClusterLoader implementation that
consults other members in the cluster for values
but a roundtrip to the other nodes would increase response times while processing requests.
The rationale behind trying to load the value exactly once is that calls to the shared external system are considered to be rather expensive, so that the increased cluster-overhead created by replication messages should still be less problematic than loading the values on every node.
In order to have some kind of isolated test and code samples for this I forked the infinispan-quickstart/clustered-cache example on Github and adapted it to my needs:
https://github.com/flpa/infinispan-quickstart/tree/master/clustered-cache
The cache is backed by a CacheLoader now and nodes are periodically fetching/putting values to demonstrate how 'put' values are replicated but values fetched from the loader are not.
To sum this up:
Is it possible to configure an Infinispan cluster in replication mode that is populated entirely by cache loader lookups and replicates the results of those lookups on all nodes?
EDIT: I accidentally deleted the Github fork, so I recreated it from scratch including only the relevant 'clustered-cache' folder and adapted the link above.
I am afraid that out-of-the-box configuration options are not available. However, you can register listener on each node for the #CacheEntryLoaded event and in this listener execute putAsync(). Be sure to include flags SKIP_CACHE_LOAD, SKIP_CACHE_STORE and IGNORE_RETURN_VALUES to get the best performance.
Related
One downside to distributed caching is that every cache query (hit or miss) is a network request which will obviously never be as fast as a local in memory cache. Often this is a worthy tradeoff to avoid cache duplication, data inconsistencies, and cache size constraints. In my particular case, it's only data inconsistency that I'm concerned about. The size of the cached data is fairly small and the number of application servers is small enough that the additional load on the database to populate the duplicated caches wouldn't be a big deal. I'd really like to have the speed (and lower complexity) of a local cache, but my data set does get updated by the same application around 50 times per day. That means that each of these application servers would need to know to invalidate their local caches when these writes occurred.
A simple approach would be to have a database table with a column for a cache key and a timestamp for when the data was last updated. The application could query this table to determine if it needs to expire it's local cache. Yes, this is a network request as well, but it would be much faster than transporting the entire cached data set over the network. Before I go and build something custom, is there an existing caching product for Java/Spring that can be configured to work this way? Is there a gotcha I'm not thinking about? Note that this isn't data that has to be transactionally consistent. If the application servers were out of sync by a few seconds, it wouldn't be a problem.
I don't know of any implementation that queries the database in the way you specify. What does exist are solutions where changes in local caches are distributed among the members in a group. JBossCache is an example where you also have the option to only distribute invalidation of objects. This might be the closest to what you are after.
https://access.redhat.com/documentation/en-us/jboss_enterprise_application_platform/4.3/html/cache_frequently_asked_questions/tree_cache#a19
JBossCache is not a spring component as such, but you create and use a cache as a spring bean should not be a problem.
Let say i have a array of memcache server, the memcache client will make sure the the cache entry is only on a single memcache server and all client will always ask that server for the cache entry... right ?
Now Consider two scenarios:
[1] web-server's are getting lots of different request(different urls) then the cache entry will be distributed among the memcache server and request will fan out to memcache cluster.
In this case the memcache strategy to keep single cache entry on a single server works.
[2] web-server's are getting lots of request for the same resource then all request from the web-server will land on a single memcache server which is not desired.
What i am looking for is the distributed cache in which:
[1] Each web-server can specify which cache node to use to cache stuff.
[2] If any web-server invalidate a cache then the cache server should invalidate it from all caching nodes.
Can memcache fulfill this usecase ?
PS: I dont have ton of resouces to cache , but i have small number of resource with a lots of traffic asking for a single resource at once.
Memcache is a great distributed cache. To understand where the value is stored, it's a good idea to think of the memcache cluster as a hashmap, with each memcached process being precisely one pigeon hole in the hashmap (of course each memcached is also an 'inner' hashmap, but that's not important for this point). For example, the memcache client determines the memcache node using this pseudocode:
index = hash(key) mod len(servers)
value = servers[index].get(key)
This is how the client can always find the correct server. It also highlights how important the hash function is, and how keys are generated - a bad hash function might not uniformly distribute keys over the different servers…. The default hash function should work well in almost any practical situation, though.
Now you bring up in issue [2] the condition where the requests for resources are non-random, specifically favouring one or a few servers. If this is the case, it is true that the respective nodes are probably going to get a lot more requests, but this is relative. In my experience, memcache will be able to handle a vastly higher number of requests per second than your web server. It easily handles 100's of thousands of requests per second on old hardware. So, unless you have 10-100x more web servers than memcache servers, you are unlikely to have issues. Even then, you could probably resolve the issue by upgrading the individual nodes to have more CPUs or more powerful CPUs.
But let us assume the worst case - you can still achieve this with memcache by:
Install each memcache as a single server (i.e. not as a distributed cache)
In your web server, you are now responsible for managing the connections to each of these servers
You are also responsible for determining which memcached process to pass each key/value to, achieving goal 1
If a web server detects a cache invalidation, it should loop over the servers invalidating the cache on each, thereby achieving goal 2
I personally have reservations about this - you are, by specification, disabling the distributed aspect of your cache, and the distribution is a key feature and benefit of the service. Also, your application code would start to need to know about the individual cache servers to be able to treat each differently which is undesirable architecturally and introduces a large number of new configuration points.
The idea of any distributed cache is to remove the ownership of the location(*) from the client. Because of this, distributed caches and DB do not allow the client to specify the server where the data is written.
In summary, unless your system is expecting 100,000k or more requests per second, it's doubtful that you will this specific problem in practice. If you do, scale the hardware. If that doesn't work, then you're going to be writing your own distribution logic, duplication, flushing and management layer over memcache. And I'd only do that if really, really necessary. There's an old saying in software development:
There are only two hard things in Computer Science: cache invalidation
and naming things.
--Phil Karlton
(*) Some distributed caches duplicate entries to improve performance and (additionally) resilience if a server fails, so data may be on multiple servers at the same time
The application I'm developing uses simple HashMaps as cache for certain objects that come from the DB. It's far from ideal, but the amount of data for these chached lists is really small (less than 100) and does not change often. This solution provides minimal overhead. When an item in one of these cached lists changes, its value is replaced in the HashMap.
We're nearing the launch date on production for this application. To provide a reasonably scalable solution, we've come with a load-balancing solution. The balancer switches between several Wildfly-nodes, which each hold the entire application, except for the DB.
The issue now is that when an cached item changes, it's only updated in one of the nodes. The change is not applied to the cache in other nodes. Possible solutions are:
Disable the caching. Not an option.
Use a cache server like Ehcache Server. In this way there would be one cache for all nodes. The problem however would be too much overhead due to REST calls.
A additional web service in every node. This web service would keep track of all load-balanced nodes. When a cached value changes in a node, the node would signal other nodes to evict their caches.
An off-the-shelf solution like Ehcache with signalling features. Does this exist?
My question is: Are there products that offer the last solution (free and with open license, commercially usable)? If not, I would implement the third solution. Are there any risks/mistakes I would have to look out for?
Risks/mistakes: Of course one major thing is data consistency. When caching data from a database I'll usually make sure I make use of transactions when updating. Usually I use a pattern like this:
begin transaction
invalidate cache entries in the transaction
update database
commit transaction
In case of a cache miss during the update happens the read needs to wait until the transaction is committed.
For your use case the typical choice is a clustered or distributed cache, like: HazelCast, Infinispan, Apache Ignite. However, somehow this seams really to heavy in your use case.
An alternative is to implement an own mechanism to publish invalidation events to all nodes. Still this is no easy task, since you may want to make sure that every node received the message, but also be fault tolerant if one nodes goes down at the same time. So you probably want to use a proper library for that, e.g. JGroups or the various MQ products.
I implemented it without JGroups or other signaling libraries. Each node has a REST endpoint to evict the cache. When a node starts up, it registers itself in a DB table with its IP, domain and a token. When it shuts down it removes its record.
When an object is updated in a node, the node evicts its cache and starts several threads that send a REST call (with its token and an object type) to all other nodes using Unirest, which in turn check the token and evict their caches. When an error is thrown, the called node is removed from the list.
It should be improved in terms of security and fault tolerance. The removal of nodes is really pessimistic now. Only after several failed attempts the node should be removed. For now, this simple solution does the job!
I am looking for a Java solution beside big memory and hazelcast. Since we are using Hadoop/Spark we should have access to Zookeeper.
So I just want to know if there is a solution satisfying our needs or do we need to build something ourself.
What I need are reliable objects that are inmemory, replicated and synchronized. For manipulation I would like to have lock support and atomic actions spanning an object.
I also need support for object references and List/Set/Map support.
The rest we can build on ourself.
The idea is simply having self organizing network that configures itself based on the environment and that is best done by synchronized objects that are replicated and one can listen to.
Hazelcast, has a split-brain detector in place and when split-brain happens hazelcast will continue to accept updates and when the cluster is merged back, it will give you an ability to merge the updates that you preferred.
We are implementing a cluster quorum feature, which will hopefully available in the next minor (3.5) version. With cluster quorum you can define a minimum threshold or a custom function of your own to decide whether cluster should continue to operate or not in a partitioned network.
For example, if you define a quorum size of 3, if there is a less than 3 members in the cluster, the cluster will stop operating.
Currently hazelcast behaves like an AP solution, but when the cluster quorum is available you can tune hazelcast to behave like a CP solution.
Please note: if the cache systems mentioned in this question work so completely differently from one another that an answer to this question is nearly-impossible, then I would simplify this question down to anything that is just JCache (JSR107) compliant.
The major players in the distributed cache game, for Java at least, are EhCache, Hazelcast and Infinispan.
First of all, my understanding of a distributed cache is that it is a cache that lives inside a running JVM process, but that is constantly synchronizing its in-memory contents across other multiple JVM processes running elsewhere. Hence Process 1 (P1) is running on Machine 1 (M1), P2 is running on M2 and P3 is running on M3. An instance of the same distributed cache is running on all 3 processes, but they somehow all know about each other and are able to keep their caches synchronized with one another.
I believe EhCache accomplishes this inter-process synchrony via JGroups. Not sure what the others are using.
Furthermore, my understanding is that these configurations are limiting because, for each node/instance/process, you have to configure it and tell it about the other nodes/instances/processes in the system, so they can all sync their caches with one another. Something like this:
<cacheConfig>
<peers>
<instance uri="myapp01:12345" />
<instance uri="myapp02:12345" />
<instance uri="myapp03:12345" />
</peers>
</cacheConfig>
So to begin with, if anything I have stated is incorrect or is mislead, please begin by correcting me!
Assuming I'm more or less on track, then I'm confused how distributed caches could possibly work in an elastic/cloud environment where nodes are regulated by auto-scalers. One minute, load is peaking and there are 50 VMs serving your app. Hence, you would need 50 "peer instances" defined in your config. Then the next minute, load dwindles to a crawl and you only need 2 or 3 load balanced nodes. Since the number of "peer instances" is always changing, there's no way to configure your system properly in a static config file.
So I ask: How do distributed caches work on the cloud if there are never a static number of processes/instances running?
One way to handle that problem is to have an external (almost static) caching cluster which holds the data and your application (or the frontend servers) are using clients to connect to the cluster. You can still scale the caching clusters up and down to your needs but most of the time you'll need less nodes in the caching cluster than you'll need frontend servers.