Ordered persistent cache - java

I need a persistent cache that holds up to several million 6 character base36 strings and has the following behavior:
- When clients retrieve N strings from the cache, they are retrieved in the order of the base36 value e.g. AAAAAA then AAAAAB etc.
- When strings are retrieved they are also removed from the cache so no other client will receive the same strings.
I am currently using MapDB as my persistent cache (I'd use EHCache but it requires a license for persistent storage).
MapDB gives me a Map to which I can put/get elements from and it handles the persisting to disk.
I have noticed that Java's ConcurrentSkipListMap class would help in my problem since it provides ordering and I can also call the pollFirstEntry method to retrieve/remove elements in order.
I am not sure how I can use this with MapDB though. Does anyone have any advice that can help me achieve the behavior that I have outlined?
Thanks

What you're describing doesn't sound like what most people would consider a cache. A cache is essentially a shared Map, with keys mapping to values, and you'd never remove on a read because you want your cache to contain the most popular items (that's what it's for).
What you're describing (ordered set of items, consumed by clients in a fixed order) is much more like a work queue. Rather than looking at cache solutions try persistent queues like RabbitMQ, Kafka, bigqueue, etc.

Related

Redis: hashmap with size limit and LRU eviction functionality

Lets say I have some keys in a redis store. I want to keep some key value pairs in a new hashmap structure. I also want to keep a limit on the size of this hashmap and evict the least recently used key value pair of the hashmap when its size (hashmap) grows beyond a limit and not touch the rest of the already present redis data structures. Does redis provide me with any such functionality where I can do this LRU style eviction of hashmap entries not touching rest of the stored keys? Or can one build it on top of what redis provides in any way? Thanks for the help!
Does redis provide me with any such functionality where I can do this LRU style eviction of hashmap entries not touching rest of the stored keys?
No, it doesn't.
Or can one build it on top of what redis provides in any way?
Yes, one can.
There are 3 ways one could go about it:
Client-side logic: you can manage the Hash's fields eviction logic in your application. This will require storing additional (meta) data in the Hash's values (i.e. delimit/structure the meta and real data in the value), at the Hash's level (you can use "special" field names, like "_eviction_heap_"), and/or with additional data structures (looks like a Sorted Set per Hash would be useful).
Server-side Lua: for optimizing the above, you can package the logic in Lua and execute it with the EVAL command.
Redis modules: this is the advanced stuff, but if you're up to it you can pretty much do anything - including implementing a new "hashmap with size limit and LRU eviction functionality" data structure.

Java caching design for 100M+ keys?

Need to cache over 100+ million string Key (~100 chars length) for Java standalone application.
Standard cache properties requisite:
Persistent.
TPS to fetch keys from cache in 10s of milli seconds range.
Allows invalidation and expiry.
Independent caching server, to allow multi-threaded access.
Preferably don't want to use enterprise database, as this 100M keys can scale to 500M which would use high memory and system resources with sluggish throughput.
For distributed cache you can try to use hazelcast.
It can be scaled as you need to and have backups and synchronizations out of the box. And it is a JSR-107 provider and have many other helpfull tools to use. However, if you want persistence, you will need to handle it by yourself or buy their enterprise version.
Finally, to resolve this big data problem, with existing cache solutions available (hazelcast, Guava cache, eh-cache etc):
Have broken the cache into two levels.
grouped ~100K keys into one java collection and associated them with common property, in my case keys were having timestamp. So, that timestamp slot became the key for this second level cache block of 100K
This time slot key is stored in Java persistent cache with value as compressed Java collection.
The reason I manage to get good throughput with 2 level caching with overheads of compression and decompression is, my key searches were range bound so when cache match found, most of the subsequent searches were addressed by in memory java collection of previous search.
To conclude: identify common attribute in keys to group and break them into multilevel cache otherwise you would need hefty hardware and enterprise cache to support this big data problem.
Try Guava Cache. It meets all of your requirement.
Links:
Guava Cache Explained
guava-cache
Persistence: Guava cache
Edit: Another One. I did not use it yet. eh-cache

Efficiently update an element in a DelayQueue

I am facing a similar problem as the author in:
DelayQueue with higher speed remove()?
The problem:
I need to process continuously incoming data and check whether the data has been seen in a certain timeframe before. Therefore I calculate a unique ID for incoming data and add this data indexed by the ID to a map. At the same time I store the ID and the timeout timestamp in a PriorityQueue, giving me the ability to efficiently check for the latest ID to time out. Unfortunately if the data comes in again before the specified timeout, I need to update the timeout stored in the PriorityQueue. So far I just removed the old ID and re-added the ID along with the new timeout. This works well, except for the time consuming remove method if my PriorityQueue grows over 300k elements.
Possible Solution:
I just thought about using a DelayQueue instead, which would make it easier to wait for the first data to time out, unfortunately I have not found an efficient way to update a timeout element stored in such a DelayQueue, without facing the same problem as with the PriorityQueue: the remove method!
Any ideas on how to solve this problem in an efficient way even for a huge Queue?
This actually sounds a lot like a Guava Cache, which is a concurrent on-heap cache supporting "expire this long after the most recent lookup for this entry." It might be simplest just to reuse that, if you can use third-party libraries.
Failing that, the approach that implementation uses looks something like this: it has a hash table, so entries can be efficiently looked up by their key, but the entries are also in a concurrent, custom linked list -- you can't do this with the built-in libraries. The linked list is in the order of "least recently accessed first." When an entry is accessed, it gets moved to the end of the linked list. Every so often, you look at the beginning of the list -- where all the least recently accessed entries live -- and delete the ones that are older than your threshold.

Should I treat Couchbase bucket as table, or more like a schema

I am planing to use Couchbase as Documentation store in my web application. I am looking at Couchbase client for Java, and you need to create separate Couchbase Client for each bucket, if I treat Couchbase bucket as I would treat generic entity. This is a bit of overkill for the system (though, I can reuse executing service to minimize object creation and thread management overhead.)
So
Is there a way to reuse existing CouchbaseClient for multiple buckets (Not only adding ExecutionService)
Would not it be better to use single bucket, and distinguish objects based on the keys, and rely on views selectors for querying, from performance point of view.
You should treat couchbase bucket like a database. One bucket per application in most cases should be enough. But I prefer to have 2 buckets. One for common data and one for "temporary" or "fast changing" (like cache, user sessions, etc.) data. For the last purpose you can even use just memcached bucket.
And answering your 2 questions:
I don't know such way and never seen that someone even tried to do that. But remember that that client should implement singleton pattern. So if you have 2 buckets for your application, you'll only have 2 clients (that's definitely doesn't overkill something)
As I said before treat bucket like a database. You even don't need to create test database. Couchbase has built-in separated dev and production views, and you can easily test your app on production data with dev views.
About using a bucket as table/database, this post explains pretty well:
http://blog.couchbase.com/10-things-developers-should-know-about-couchbase
Start with everything in one bucket
A bucket is equivalent to a database. You store objects of different characteristics or attributes in the same bucket. So if you are moving from a RDBMS, you should store records from multiple tables in a single bucket.
Remember to create a “type” attribute that will help you differentiate the various objects stored in the bucket and create indexes on them. It is recommended to start with one bucket and grow to more buckets when necessary.

What is the best way to deal with collections (lists or sets) in key-value storage?

I wonder what can be an effective way to add/remove items from a really large list when your storage is memcached-like? Maybe there is some distributed storage with Java interface that deals with this problem well?
Someone may recommend Terracotta. I know about it, but that's not exactly what I need. ;)
Hazelcast 1.6 will have distributed implementation MultiMap, where a key can be associated with a set of values.
MultiMap<String, String> multimap = Hazelcast.getMultiMap ("mymultimap");
multimap.put ("1", "a");
multimap.put ("1", "b");
multimap.put ("1", "c");
multimap.put ("2", "x");
multimap.put ("2", "y");
Collection<String> values = multimap.get("1"); //containing a,b,c
Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.
Hazelcast is released under Apache license and enterprise grade support is also available. Code is hosted at Google Code.
Maybe you should also have a look at Scalaris!
You can use a key-value store to model most data structures if you ignore concurrency issues. Your requirements aren't entirely clear, so I'm going to make some assumptions about your use case. Hopefully if they are incorrect you can generalize the approach.
You can trivially create a linked list in the storage by having a known root (let's call it 'node_root') node which points to a value tuple of {data, prev_key, next_key}. The prev_key and next_key elements are key names which should follow the convention 'node_foo' where foo is a UUID (ideally you can generate these sequentially, if not you can use some other type of UUID). This provides ordered access to your data.
Now if you need O(1) removal of a key, you can add a second index on the structure with key 'data' and value 'node_foo' for the right foo. Then you can perform the removal just as you would a linked list in memory. Remove the index node when you're done.
Now, keep in mind that concurrent modification of this list is just as bad as concurrent modification of any shared data structure. If you're using something like BDBs, you can use their (excellent) transaction support to avoid this. For something without transactions or concurrency control, you'll want to provide external locking or serialize accesses to a single thread.

Categories

Resources