Apache Ignite - Near Cache consistency

Apache Ignite - Near Cache consistency - java

I am evaluating Apache Ignite to check if it fits our company's need. So far so good. Now I am trying to understand how the near cache feature works in terms of consistency.
We currently have several micro-services with one Ignite configured in client mode in each. All these instances are connected to several Ignite servers in a cluster. For some use cases (reads>>>writes) it seems reasonable to use a near cache in front of the cache servers. I have checked and it seems to automatically invalidate "stale data" in all instances in case of the write, which is good.
My question: is there any documentation besides this one that explains how it works? In particular, I would like to understand if any subsequent read requests (after the write one) to any other instances will get the updated data (no eventual consistency).
Thanks!

With FULL_SYNC mode all copies are always consistent, no eventual consistency. Near cache functions as a sort of additional backup copy.
I don't think there is a design document on how it works though.

Related

Spring local cache for multiple app instances with invalidation through Kafka

I have a usual SpringBoot application, which executes tons of DB calls and for those I want to implement some Spring caching with normal #Cacheable / #CacheEvict and other annotations (by default I use CaffeineCache). There are several AKS nodes, on each of them one instance of my application is running. What I want to receive:
Local (in-memory) Spring cache. A distributed solution aka Redis-based or so is not suitable.
A cache should be invalidated for all running instances of the app after the update on one of them.
I have a global Kafka service, which registers every write/update request to my Cassandra DB
Now my question - is that possible to have a local, usual Spring cache with such an invalidation through Kafka resulting of course in synchronized cache version on all instances?

I would say it is possible in principle. You could build a naive solution, where
Read operations use #Cacheable
Write operations put a message to the Kafka bus, and each node has a listener that uses #CachePut to write it into the local cache.
But such a naive solution will not have any strict synchronisation guarantees, it is only eventually consistent. It takes time to propagate updates to the other nodes and in between other nodes could still read the old value. Also you would have to think about error conditions where an update could get lost.
If you want to have stricter guarantees, you need a multi-phase commit or a consensus protocol. Unless it is a research project I would highly discourage you from writing one yourself. These are not trivial because the problem is not trivial. Instead you should use existing implementations.
So in summary: If you don't need consistency, you could do it like you suggest. If you need any level of consistency guarantee, you should use an existing distributed cache, which can still be integrated with #Cacheable.

Multiple cache implementations at method level in Spring Boot

What I want is, two different cache implementations (let's say Redis and EhCache) on one method. Meaning #Cacheable method should cache both Redis and EhCache.
Is it even possible?

Option 1:
Stack the caches. Configure the local cache in Spring. Then wire in the distributed cache via a CacheLoader/CacheWriter. Consistency needs to be carefully evaluated. E.g. if an update goes to the distributed cache, how do you invalidate the local caches? That is not so trivial. Maybe it is easy and not needed for your data, maybe it is near to impossible.
Option 2:
Go with a distributed cache which provides a so called near cache. That is actually the combination you want to do by yourself, but combined in a product. I know that hazelcast and Infinispan offer a near cache. However, your mileage may vary regarding consistency and resilience. I know that hazelcast just recently enhanced the near cache, so it is consistent.
Interesting use case and actually a common problem. Further thoughts and discussion highly appreciated.

Hibernate Search cluster and near-real-time search

I'm trying to find the best indexing solution for implementing a search-engine in my clustered webapp, and I cannot find a clear answer to my questions in official documentations.
My Java/Java EE backend will be deployed among several load-balanced instances. The search-engine will require near-real-time availability of indexed data (i.e. less than 5 seconds between the indexation and the retrievability).
Hibernate Search can work in a clustered environment with JGroups but the documentation also says, about near-real-time that as a tradeoff it requires a non-clustered and non-shared index.
Does that mean that NRTIndexManager cannot be used in a JGroups Slave/Master setup ? i.e. can only be used whith one single node ?
Does that mean that with such a setup, the availability of indexed data depends only on the refresh period (period of index copy to slave nodes) ?

With the standard IndexManager, you only see the latest changes when they are written to the disk and you reopen your IndexSearcher.
By default, Hibernate Search writes to disk and opens a new IndexSearcher for each query so you're sure your searches are always in sync with your database.
The NRTIndexManager is different from the standard one because it allows you to search on the latest changes indexed without an explicit write on disk. It's typically used when you need a high throughput and you can't write everything on the disk right away. So it's not really correlated to the fact that you will see your changes right away or not: it's an optimization when you can allow some index data loss - the latest changes might be lost.
As mentioned in the documentation here http://docs.jboss.org/hibernate/search/5.5/reference/en-US/html_single/#jgroups-backend , you can have a sync JGroups with Hibernate Search blocking until all the indexes are in sync. So it can work for your case.
Note that we are currently working for 5.6 on an Elasticsearch backend which might be of some interest to you as it's typically designed for your case. It's still in beta but it's already in pretty good shape. You might want to take a look to it: http://docs.jboss.org/hibernate/search/5.6/reference/en-US/html/ch11.html .

Synchronize between two Java application via Database

Background]
- There are two java applications (A and B), and they can only communicate via Oracle DB
- A and B share the same database table
- A and B stores the data in cache
Problem]
If A performs simple transaction (insert/update/delete), the cache in A is updated. Also, the cache in B should be updated automatically!
Current Status]
Two solutions I found and tried
- Solution1) Using DatabaseChangeListener
- Solution2) Using Socket Programming
Question]
The solution will be used for company, and I would like to know if there is anything that I can improve my solutions.
1) What could be disadvantages if I use DatabaseChangeListener?
2) What could be disadvantages if I use socket programming? (Maybe it's too low-level that developer cannot maintain due to company policy?)
3) I heard there are 3rd party cache that also supports synchronization. Am I correct?
Please let me know if you need more information!
Thank you very much in advance!
[EDIT]
If would be much appreciated if you can leave a comment when you down-vote this. I would like to know how I can improve this question with your feedback! Thank you

Your question appears every now and then with slightly different aspects. One useful answer to that is here: Guava Cache, how to block access while doing removal
About using the DatabaseChangeListener:
Although you are fine with oracle, I would discourage the use of vendor specific interfaces. For me, it would be okay to use, if it is an performance optimization, but I would never use vendor specific interfaces for basic functionality.
Second, the usage of the change listener may still lead to dirty reads.
About "distributed caches" as veritas suggested:
There is a difference between distributed caches and clustered caches. Distributed caches spread (aka distribute) the cached data on different nodes, clustered caches are caches for clustered applications that keep track of data consistency within the cluster. A distributed cache usually is a clustered cache, but not the other way around. For a general idea on the topic I recommend the infinispan documentation on clustering as an intro: http://infinispan.org/docs/7.0.x/user_guide/user_guide.html#_clustering
Wrap up:
A clustered cache implementation is the thing you need. However, if you want data consistency, you still need to carefully design your transaction handling.
You can, of course, also do socket communication yourself and send simple object invalidate messages to the other applications. The challenging part is the error handling. When was the invalidate successful? Is there a timeout for the other nodes to acknowledge? When to drop a node and maintain a cluster state at all?

I will suggest for the 3rd Party Cache, if you have many similar use cases or many tables need to be updated .
Please read about terracotta Distributed Cache.
It gives exactly what you want.
You can also look for hazelcast or memcached

Choice between REST API or Java API

I have been reading about neo4j last few days. I got very confused about whether I need to use REST API or if can I go with Java APIs.
My need is to create millions of nodes which will have some connection among them. I want to add indexes on few of node attributes for searching. Initially I started with embedded mode of GraphDB with Java API but soon reached OutOfMemory with indexing on few nodes so I thought it would be better if my neo4j is running as service and I connect to it through REST API then it will do all memory management by itself by swapping in/out data to underlying files. Is my assumption right?
Further, I have plans to scale my solution to billion of nodes which I believe wont be possible with single machine's neo4j installation. I also believe Neo4j has the capability of running in distributed mode. For this reason also I thought continuing with REST API implementation is best idea.
Though I couldn't find out any good documentation about how to run Neo4j in distributed environment.
Can I do stuff like batch insertion, etc. using REST APIs as well, which I do with Java APIs with Graph DB running in embedded mode?

Do you know why you are getting your OutOfMemory Exception? This sounds like you are creating all these nodes in the same transaction, which causes it to live in memory. Try committing small chunks at a time, so that Neo4j can write it to Disk. You don't have to manage the memory of Neo4j aside from things like cache.
Distributed mode is in a Master/Slave architecture, so you'll still have a copy of the entire DB on each system. Neo4j is very efficient for disk storage, a Node taking 9 Bytes, Relationship taking 33 Bytes, properties are variable.
There is a Batch REST API, which will group many calls into the same HTTP call, however making REST calls is still a slower then if this were embedded.
There are some disadvantages to using the REST API that you did not mentions, and that's stuff like transactions. If you are going to do atomic operations, where you need to create several nodes, relationships, change properties, and if any step fails not commit any of it, you cannot do this in the REST API.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.