Redis: hashmap with size limit and LRU eviction functionality - java

Lets say I have some keys in a redis store. I want to keep some key value pairs in a new hashmap structure. I also want to keep a limit on the size of this hashmap and evict the least recently used key value pair of the hashmap when its size (hashmap) grows beyond a limit and not touch the rest of the already present redis data structures. Does redis provide me with any such functionality where I can do this LRU style eviction of hashmap entries not touching rest of the stored keys? Or can one build it on top of what redis provides in any way? Thanks for the help!

Does redis provide me with any such functionality where I can do this LRU style eviction of hashmap entries not touching rest of the stored keys?
No, it doesn't.
Or can one build it on top of what redis provides in any way?
Yes, one can.
There are 3 ways one could go about it:
Client-side logic: you can manage the Hash's fields eviction logic in your application. This will require storing additional (meta) data in the Hash's values (i.e. delimit/structure the meta and real data in the value), at the Hash's level (you can use "special" field names, like "_eviction_heap_"), and/or with additional data structures (looks like a Sorted Set per Hash would be useful).
Server-side Lua: for optimizing the above, you can package the logic in Lua and execute it with the EVAL command.
Redis modules: this is the advanced stuff, but if you're up to it you can pretty much do anything - including implementing a new "hashmap with size limit and LRU eviction functionality" data structure.

Related

Java caching design for 100M+ keys?

Need to cache over 100+ million string Key (~100 chars length) for Java standalone application.
Standard cache properties requisite:
Persistent.
TPS to fetch keys from cache in 10s of milli seconds range.
Allows invalidation and expiry.
Independent caching server, to allow multi-threaded access.
Preferably don't want to use enterprise database, as this 100M keys can scale to 500M which would use high memory and system resources with sluggish throughput.
For distributed cache you can try to use hazelcast.
It can be scaled as you need to and have backups and synchronizations out of the box. And it is a JSR-107 provider and have many other helpfull tools to use. However, if you want persistence, you will need to handle it by yourself or buy their enterprise version.
Finally, to resolve this big data problem, with existing cache solutions available (hazelcast, Guava cache, eh-cache etc):
Have broken the cache into two levels.
grouped ~100K keys into one java collection and associated them with common property, in my case keys were having timestamp. So, that timestamp slot became the key for this second level cache block of 100K
This time slot key is stored in Java persistent cache with value as compressed Java collection.
The reason I manage to get good throughput with 2 level caching with overheads of compression and decompression is, my key searches were range bound so when cache match found, most of the subsequent searches were addressed by in memory java collection of previous search.
To conclude: identify common attribute in keys to group and break them into multilevel cache otherwise you would need hefty hardware and enterprise cache to support this big data problem.
Try Guava Cache. It meets all of your requirement.
Links:
Guava Cache Explained
guava-cache
Persistence: Guava cache
Edit: Another One. I did not use it yet. eh-cache

Ordered persistent cache

I need a persistent cache that holds up to several million 6 character base36 strings and has the following behavior:
- When clients retrieve N strings from the cache, they are retrieved in the order of the base36 value e.g. AAAAAA then AAAAAB etc.
- When strings are retrieved they are also removed from the cache so no other client will receive the same strings.
I am currently using MapDB as my persistent cache (I'd use EHCache but it requires a license for persistent storage).
MapDB gives me a Map to which I can put/get elements from and it handles the persisting to disk.
I have noticed that Java's ConcurrentSkipListMap class would help in my problem since it provides ordering and I can also call the pollFirstEntry method to retrieve/remove elements in order.
I am not sure how I can use this with MapDB though. Does anyone have any advice that can help me achieve the behavior that I have outlined?
Thanks
What you're describing doesn't sound like what most people would consider a cache. A cache is essentially a shared Map, with keys mapping to values, and you'd never remove on a read because you want your cache to contain the most popular items (that's what it's for).
What you're describing (ordered set of items, consumed by clients in a fixed order) is much more like a work queue. Rather than looking at cache solutions try persistent queues like RabbitMQ, Kafka, bigqueue, etc.

How to store mapping tables - lucene or DB?

I want to store large mapping tables between an id and two text attributes.
The dataset will be up to 1 million entries and refreshed on a daily basis.
Would you rather create a lucene index and an index table by that id? Or create a database (postgres) table with id as primary key? Or even a different solution?
And why would one prefer either solution?
I only want to lookup by ID, no reverse lookup. The mapping table should be simple as that: put in an id, and get back two string attributes.
What you are looking for appears to be a Key-value store (wikipedia article)
Key-value (KV) stores use the associative array (also known as a map
or dictionary) as their fundamental data model. In this model, data is
represented as a collection of key-value pairs, such that each
possible key appears at most once in the collection.
The key-value model is one of the simplest non-trivial data models,
and richer data models are often implemented on top of it. The
key-value model can be extended to an ordered model that maintains
keys in lexicographic order. This extension is powerful, in that it
can efficiently process key ranges.
Key-value stores can use consistency models ranging from eventual
consistency to serializability. Some support ordering of keys. Some
maintain data in memory (RAM), while others employ solid-state drives
or rotating disks.
The article there also gives a rather complete list of available implementations. Unfortunately I cannot suggest you one of the implementations, as I have not used any of these in production. But I strongly believe that google is full of comparisons of key-value stores.
To answer your question, I would not go for Lucene, as it is a open source information retrieval software library, designed to implement information retrieval applications. What you are going to do is not going to hit Lucene's sweet-spots.
A classic RDBMS comes closer to your requirements. But as stated above a Key-value store would nail it.

When to hit the database instead of searching an object in your List

If we are using MongoDB (NoSQL) or MySQL (Relational DB) to retrieve objects and we want to search for a specific element (where clause) but we already have an in-memory list (LinkedList, ArrayList or whatever) containing some of the Objects, either for caching or for any other reason.
Is there an equation / library that can "advice" when is cheaper to use the in-memory structure for retrieval instead of querying the Database? (taking into consideration,for example, the size of the in-memory list?)
It's always cheaper to query the in-memory cache.
The only exception would be if you had constructed the cache so that it was very inefficient to search it (e.g., linear search). But as long as it's a hash-based structure, it'll be doing at worst what the database needs to do, but without the network overhead. Looking something up in a hash table is essentially free.
The bigger question is whether your cache uses so much memory it starves the rest of the application. You'll want a weak hash map or similar, to avoid this. If this is a cache produced by some sort of ORM, it'll be weakly referenced already.

Algorithm to store Item-to-Item-Associations

I need some help to store some data efficiently. I have a large list of objects (about 100.000) and want to store associations between this items with a coefficient. Not all items are associated, in fact I have something about 1 Mio. Associations. I need fast access to these associations when referencing by the two items. What I did is a structure like that:
Map<Item, Map<Item, Float>>
I tried this with HashMap and Hashtable. Both work fine and is fast enough. My problem is, that all that Maps create a lot of overhead in memory, concrete for the given scenario more than 300 MB. Is there a Map-Implementation with less footprint? Is there maybe a better algorithm to store that kind of data?
Here are some ideas:
Store in a Map<Pair<Item,Item>,Float>. If you are worried about allocating a new Pair for each lookup, and your code is synchronized, you can keep a single lookup Pair instance.
Loosen the outer map to be Map<Item, ?>. The value can be a simple {Item,Float} tuple for the first association, a small tuple array for a small number of associations, then promote to a full fledged Map.
Use Commons Collections' Flat3Map for the inner maps.
If you are in tight control of the Items, and Item equivalence is referential (i.e. each Item instance is not equal to any other Item instance, then you can number each instance. Since you are talking about < 2 billion instances, a single Long will represent an Item pair with some bit manipulation. Then the map gets much smaller if you use Trove's TLongObjectHashMap
You have two options.
1) Reduce what you're storing.
If your data is calculable, using a WeakHashMap will allow the garbage collector to remove members. You will probably want to decorate it with a mechanism that calculates lost or absent key/value pairs on the fly. This is basically a cache.
Another possibility that might trim a relatively tiny amount of RAM is to instruct your JVM to use compressed object pointers. That may save you about 3 MB with your current data size.
2) Expand your capacity.
I'm not sure what your constraint is (run-time memory on a desktop, serialization, etc.) but you can either expand the heapsize and deal with it, or you can push it out of process. With all those "NoSQL" stores out there, one will probably fit your needs. Or, an indexed db table can be quite fast. If you're looking for a simple key-value store, Voldemort is extremely easy to set up and integrate.
However, I don't know what you're doing with your working set. Can you give more details? Are you performing aggregations, partitioning, cluster analysis, etc.? Where are you running into trouble?

Categories

Resources