Java caching design for 100M+ keys?

Java caching design for 100M+ keys? - java

Need to cache over 100+ million string Key (~100 chars length) for Java standalone application.
Standard cache properties requisite:
Persistent.
TPS to fetch keys from cache in 10s of milli seconds range.
Allows invalidation and expiry.
Independent caching server, to allow multi-threaded access.
Preferably don't want to use enterprise database, as this 100M keys can scale to 500M which would use high memory and system resources with sluggish throughput.

For distributed cache you can try to use hazelcast.
It can be scaled as you need to and have backups and synchronizations out of the box. And it is a JSR-107 provider and have many other helpfull tools to use. However, if you want persistence, you will need to handle it by yourself or buy their enterprise version.

Finally, to resolve this big data problem, with existing cache solutions available (hazelcast, Guava cache, eh-cache etc):
Have broken the cache into two levels.
grouped ~100K keys into one java collection and associated them with common property, in my case keys were having timestamp. So, that timestamp slot became the key for this second level cache block of 100K
This time slot key is stored in Java persistent cache with value as compressed Java collection.
The reason I manage to get good throughput with 2 level caching with overheads of compression and decompression is, my key searches were range bound so when cache match found, most of the subsequent searches were addressed by in memory java collection of previous search.
To conclude: identify common attribute in keys to group and break them into multilevel cache otherwise you would need hefty hardware and enterprise cache to support this big data problem.

Try Guava Cache. It meets all of your requirement.
Links:
Guava Cache Explained
guava-cache
Persistence: Guava cache
Edit: Another One. I did not use it yet. eh-cache

Related

Alternate to Hashmap as cache other than in memory databases.?

I am using hashmap as a cache to store id and name. because it is frequently used and it lives through the lifetime of the application.
for every user using the application, around 5000 and more (depends on workspace) ids and names get stored in hashmap. At some point java.lang.OutOfMemoryError exception gets thrown. since I am saving a lot of (id, name) in hashmap.
I don't want to clear my hashmap cache value. but I know to be efficient we have to clear cache using the LRU approach or other approaches.
Note: I don't want to use Redis, Memcached, or any in-memory key-value store.
Usecase: slack will return id in place of the user name in every
message.
for eg: Hello #john doe = return Hello #dxap123.
I don't want an API hit for every message to get the user name.
Can somebody provide me an alternate efficient approach or correct me if I am doing something wrong in my approach.?

Like others have said 5000 shouldn't give you out of memory, but if you don't keep a limit on the size of the map eventually you will get out of memory error. You should cache the values that are most recently used or most frequently used to optimize the size of the map.
Google guava library has cache implementations which i think would fit your usecase
https://github.com/google/guava/wiki/CachesExplained

For 5000 key-value pairs, It should not through OutOfMemoryException. If It is throwing the same you are not managing the HashMap properly. If you have more than 5000 items and want an alternate for hashmap you can use ehcache, A widely adopted Java cache with tiered storage options instead of going with in-memory cache technologies.
The memory areas supported by Ehcache include:
On-Heap Store: Uses the Java heap memory to store cache entries and shares the memory with the application. The cache is also scanned by the garbage collection. This memory is very fast, but also very limited.
Off-Heap Store: Uses the RAM to store cache entries. This memory is not subject to garbage collection. Still quite fast memory, but slower than the on-heap memory, because the cache entries have to be moved to the on-heap memory before they can be used.
Disk Store: Uses the hard disk to store cache entries. Much slower than RAM. It is recommended to use a dedicated SSD that is only used for caching.
You can find the documentation here. http://www.ehcache.org/documentation/
If you are using spring-boot you can follow this article to implement the same.
https://springframework.guru/using-ehcache-3-in-spring-boot/

If the "names" are not unique, then try with calling on it String.intern() before inserting the "name" to the map, this would then reduce the memory usage.

Ordered persistent cache

I need a persistent cache that holds up to several million 6 character base36 strings and has the following behavior:
- When clients retrieve N strings from the cache, they are retrieved in the order of the base36 value e.g. AAAAAA then AAAAAB etc.
- When strings are retrieved they are also removed from the cache so no other client will receive the same strings.
I am currently using MapDB as my persistent cache (I'd use EHCache but it requires a license for persistent storage).
MapDB gives me a Map to which I can put/get elements from and it handles the persisting to disk.
I have noticed that Java's ConcurrentSkipListMap class would help in my problem since it provides ordering and I can also call the pollFirstEntry method to retrieve/remove elements in order.
I am not sure how I can use this with MapDB though. Does anyone have any advice that can help me achieve the behavior that I have outlined?
Thanks

What you're describing doesn't sound like what most people would consider a cache. A cache is essentially a shared Map, with keys mapping to values, and you'd never remove on a read because you want your cache to contain the most popular items (that's what it's for).
What you're describing (ordered set of items, consumed by clients in a fixed order) is much more like a work queue. Rather than looking at cache solutions try persistent queues like RabbitMQ, Kafka, bigqueue, etc.

Using Hazelcast / Redis for DB backed cache requirement

I am developing a distributed Java application that needs to check a list of blacklist userids on each request.
If request fails on some eligibility rules, system should add userid ( a parameter of request ) to blacklist.
I am trying to find a proper caching solution for blacklist implementation. My requirements are;
querying blacklist should be very fast
blacklist persistence technology should be scalable
all blacklist data should be persisted on a RDBMS also for fail over / reloading purposes.
They are two possible solutions;
Option 1: I can use redis for storing blacklist data. Whenever a request fails on eligibility rules I can add userid to redis cache easly.
- advantages: extremely fast query, easy to implement
- disadvantages: trusting on redis persistency although it works, it is a cache solution by design not a persistency layer.
Option 2: I can use redis for storing blacklist data meanwhile I can maintain db tables on RDBMS for blacklist. Whenever a request fails on eligibility rules I can add userid to redis cache and rdbms table together.
- advantages: extremely fast query, ability(possibility) to reload redis cache from db
- disadvantages: there is a consistency issue between redis and db table.
Option 3: I can use hazelcast as hibernate L2 cache and when I add any user id to blacklist it is both added to cache and db.
I have questions about option 3
Does hazelcast L2 cache is suitable for preserving such a list of blacklisted users?
Does hibernate manages consistency issue between cache and db?
When application restarted, how L2 cache is reloaded?
and a last question
- Do you have any other suggestion for such a use-case?
Edit:
There will be 100m records in blacklist and I have a couple smilar blacklist.
my read performance is important. I need to query existence of a key within blacklist ~100ms

Ygok,
Still waiting for clarification on the query requirements but I can assume it a lookup by key (since you mention Redis and Redis doesn't have a query language. Hazelcast does have Distributed Query / Predicate API).
Lookup by key is an extremely fast operation with Hazelcast.
In option 2 you need to maintain data consistency between your RDBMS and Redis cache. Using Hazelcast MapLoader / MapStore you can implement write-through- / read-through- cache concepts. All you need to do is put the entry to the cache, and Hazelcast persists it immediately or with configured delay (with batching) to the RDBMS.
In terms of performance, please, feel free to make yourself familiar with recent Hazelcast / Redis benchmark.
Let me know if you have any questions.

I had similar question before, first of all, how much data do you want to store and spend how much memory? how fast query per second do you need? what the data structure like, only userId as a key?
Hazelcast query not very fast on my testing(you can do it for yourself), but it can store large memory data. Hazelcast using Java
default serialize, it cost a lot of memory and IO.
Hazelcast provide hibernate L2 cache, cache data store on
Hazelcast(only query cache), so restart your application not affect
the cache.
Redis provide memory data persistence(DUMP and AOF), maybe a
bit of data will be lost when server crashed, but it very fast.
If you want to not lose any data, store on multi MySQL
server(split data by userId to different server, but you should
consider the problems when add new server), at the same time, you can
add local cache (e.g. Ehcache or google CacheBuilder) and set a
expire time, it can be promote performance.

It's possible to maintain consistency between Redis cache and RDBMS using Redisson framework. It provides write-through and read-through strategies for Map object using MapWriter and MapLoader objects which are required to use in your case.
Please read this documentation section

Java cache with expiration since first write

I have events which should be accumulated into persistent key-value store. After 24 hours after key first insert this accumulated record should be processed and remove from store.
Expired data processing is distributed among multiple nodes, so use of database involves processing synchronization problems. I don't want to use any SQL database.
The best fit for me is probably some cache with configurable expiration policy according to my needs. Is there any? Or can be this solved with some No-SQL database?

It should be possible with products like infinispan or hazelcast.
Both are JSR107 compatible.
With a JSR107 compatible cache API a possible approach is to set your 24h hours expiry via the CreatedExpiryPolicy. Next, you implement and register CacheEntryExpiredListener to get a call when the entry is expired.
The call on the CacheEntryExpiredListener may be lenient and implementation dependent. Actually the event is triggered on the "eviction due to expiry". For example, one implementation may do a peridoc scan and remove expired entries every 30 minutes. However I think that "lag time" is adjustible in most implementations, so you will be able to operate in defined bounds.
Also check whether there are some resource constraints for the event callbacks you may run into, like thread pools.
I mentioned infispan or hazelcast for two reasons:
You may need the distribution capabilities.
Since you do long running processing and store data that is not recoverable, you may need the persistence and fault tolerance features. So I would say a simple in memory cache like Google Guava is out of the scope.
Good luck!

Using hashmap or H2 database?

I am developing a web application in which I need to store session, user messages etc. I am thinking of using HashMap or H2 database.
Please let me know which is better approach in terms of performance and memory utilization. The web site has to support 10,000 users.
Thanks.

As usual with these questions, I would worry about performance as/when you know it's an issue.
10000 users is not a lot of data to hold in memory. I would likely start off with a standard Java collection, and look at performance when you predict it's going to cause you grief.
Abstract out the access to this Java collection such that when you substitute it, the refactoring required is localised (and perhaps make it configurable, such that you can easily perform before/after performance tests with your different solutions -H2, Derby, Oracle, etc. etc.)

If your session objects aren't too big (which should be the case), there is no need to persist them in a database.
Using a database for this would add a lot of complexity in a case when you can start with a few lines of code. So don't use a database, simply store them in a ligth memory structure (HashMap for example).
You may need to implement a way to clean your HashMap if you don't want to keep sessions in memory when the user left from a long time. Many solutions are available (the easiest is simply to have a background thread removing from time to time the too old sessions). Note that it's usually easier to clean a hashmap than a database.

Both H2 and Hash Map are gonna keep the data in memory (So from space point of view they are almost the same).
If look ups are simple like KEY VALUE then looking up in the Hash Map will be quicker.
If you have to do comparisons like KEY < 100 etc use H2.
In fact 10K user info is not that high a number.

If you don't need to save user messages - use the collections. But if the message is should be saved, be sure to use a database. Because after restart you lost all data.

The problem with using a HashMap for storing objects is that you would run into issues when your site becomes too big for one server and would need to be clustered in order to scale with demand. Then you would face problems with how to synchronise the HashMap instances on different servers.
A possible alternative would be to use a key-value store like Redis as you won't need the structure of a database or even use the distributed cache abilities of something like EHCache

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.