Reliable distributed cache on app engine (Java)

Reliable distributed cache on app engine (Java) - java

I need to keep some values in memory, sort of in-memory db. In terms of reliability, I am not affraid of system failure, I can live with that. However, I can not use memcache service, because the values can be evicted anytime. I need the values to be available on other machines, when application scales. I suppose that appengine will not make memory scale or will it (e.g. if I keep value in an ordinary Java collection)?
What I am trying to achieve here is a "pick a nickname" service. This works in two steps. First, user reserves a nickname. Then he registers the nickname. Nicknames are stored under one entity group (sic!). Therefore I need to avoid datastore contention.
As far as I understand from the https://developers.google.com/appengine/articles/scaling/memcache I can to a certain extent rely on that values in memcache should not be evicted on arbitrary resons. However, I have to count on that this will happen from time to time (e.g. on high memory levels). And this losses of value are very unpleasant to my users.

Your application shares a single instance of Memcache, it is not local to a "machine" (or rather instance of your application).
So if you are running 2 instances and they both retrieve the same value from memcache they will both get the same value.
Running an "in memory" database is not feasible in the cloud - what memory is it you were planning to use, the memory in the instance that's about to shut down?
https://developers.google.com/appengine/articles/scaling/memcache
When designing your application, take the time to consider which datasets can be cached for future reuse. These could be commonly viewed pages or often read datastore entities, just to name a few. There may also be some data in your application which you would like to have shared among all instances of your app but does not need to be persisted forever. In such cases, memcache can improve the scalability of your app by providing a fast and efficient distributed storage system for transient data. Adding memcache logic to your server side code is often well worth the few extra lines of code.

You can use app engine NDB, when you use Python27. NDb is a datastore with auto caching and much more.
Other Machines ? You mean shared between instances of the same app.

Related

Local Cache with Distributed Invalidation (Java/Spring)

One downside to distributed caching is that every cache query (hit or miss) is a network request which will obviously never be as fast as a local in memory cache. Often this is a worthy tradeoff to avoid cache duplication, data inconsistencies, and cache size constraints. In my particular case, it's only data inconsistency that I'm concerned about. The size of the cached data is fairly small and the number of application servers is small enough that the additional load on the database to populate the duplicated caches wouldn't be a big deal. I'd really like to have the speed (and lower complexity) of a local cache, but my data set does get updated by the same application around 50 times per day. That means that each of these application servers would need to know to invalidate their local caches when these writes occurred.
A simple approach would be to have a database table with a column for a cache key and a timestamp for when the data was last updated. The application could query this table to determine if it needs to expire it's local cache. Yes, this is a network request as well, but it would be much faster than transporting the entire cached data set over the network. Before I go and build something custom, is there an existing caching product for Java/Spring that can be configured to work this way? Is there a gotcha I'm not thinking about? Note that this isn't data that has to be transactionally consistent. If the application servers were out of sync by a few seconds, it wouldn't be a problem.

I don't know of any implementation that queries the database in the way you specify. What does exist are solutions where changes in local caches are distributed among the members in a group. JBossCache is an example where you also have the option to only distribute invalidation of objects. This might be the closest to what you are after.
https://access.redhat.com/documentation/en-us/jboss_enterprise_application_platform/4.3/html/cache_frequently_asked_questions/tree_cache#a19
JBossCache is not a spring component as such, but you create and use a cache as a spring bean should not be a problem.

Distributed cache with duplicate cache entries on different host

Let say i have a array of memcache server, the memcache client will make sure the the cache entry is only on a single memcache server and all client will always ask that server for the cache entry... right ?
Now Consider two scenarios:
[1] web-server's are getting lots of different request(different urls) then the cache entry will be distributed among the memcache server and request will fan out to memcache cluster.
In this case the memcache strategy to keep single cache entry on a single server works.
[2] web-server's are getting lots of request for the same resource then all request from the web-server will land on a single memcache server which is not desired.
What i am looking for is the distributed cache in which:
[1] Each web-server can specify which cache node to use to cache stuff.
[2] If any web-server invalidate a cache then the cache server should invalidate it from all caching nodes.
Can memcache fulfill this usecase ?
PS: I dont have ton of resouces to cache , but i have small number of resource with a lots of traffic asking for a single resource at once.

Memcache is a great distributed cache. To understand where the value is stored, it's a good idea to think of the memcache cluster as a hashmap, with each memcached process being precisely one pigeon hole in the hashmap (of course each memcached is also an 'inner' hashmap, but that's not important for this point). For example, the memcache client determines the memcache node using this pseudocode:
index = hash(key) mod len(servers)
value = servers[index].get(key)
This is how the client can always find the correct server. It also highlights how important the hash function is, and how keys are generated - a bad hash function might not uniformly distribute keys over the different servers…. The default hash function should work well in almost any practical situation, though.
Now you bring up in issue [2] the condition where the requests for resources are non-random, specifically favouring one or a few servers. If this is the case, it is true that the respective nodes are probably going to get a lot more requests, but this is relative. In my experience, memcache will be able to handle a vastly higher number of requests per second than your web server. It easily handles 100's of thousands of requests per second on old hardware. So, unless you have 10-100x more web servers than memcache servers, you are unlikely to have issues. Even then, you could probably resolve the issue by upgrading the individual nodes to have more CPUs or more powerful CPUs.
But let us assume the worst case - you can still achieve this with memcache by:
Install each memcache as a single server (i.e. not as a distributed cache)
In your web server, you are now responsible for managing the connections to each of these servers
You are also responsible for determining which memcached process to pass each key/value to, achieving goal 1
If a web server detects a cache invalidation, it should loop over the servers invalidating the cache on each, thereby achieving goal 2
I personally have reservations about this - you are, by specification, disabling the distributed aspect of your cache, and the distribution is a key feature and benefit of the service. Also, your application code would start to need to know about the individual cache servers to be able to treat each differently which is undesirable architecturally and introduces a large number of new configuration points.
The idea of any distributed cache is to remove the ownership of the location(*) from the client. Because of this, distributed caches and DB do not allow the client to specify the server where the data is written.
In summary, unless your system is expecting 100,000k or more requests per second, it's doubtful that you will this specific problem in practice. If you do, scale the hardware. If that doesn't work, then you're going to be writing your own distribution logic, duplication, flushing and management layer over memcache. And I'd only do that if really, really necessary. There's an old saying in software development:
There are only two hard things in Computer Science: cache invalidation
and naming things.
--Phil Karlton
(*) Some distributed caches duplicate entries to improve performance and (additionally) resilience if a server fails, so data may be on multiple servers at the same time

Write-behind caching solution for Java objects, using oracle stored procs for persistence

Im currently working on a high throughput, low latency transaction engine. For audit reasons I need to maintain object state both locally, and also persist it to DB (Oracle).
Our DBAs insist that raw SQL is not allowed, so we use stored procedures to read/write data to the database.
I've looked around, but cannot find any obvious solution.
Is there anything out there that will act as a write-behind cache (for performance) that will allow me to specify (on a per class basis) the code that is used to persist/retreive objects (so I can inject the sproc handling code)?

What I have done in the past in this situation is to write the data to Java Chronicle and have this forwarded to a database in another thread or process. Java Chronicle supports low latency persisted IPC. You can persist objects at a rate of over one million per second with sub-micro-second latencies. The reading process can pick up those objects/events with in 100 nano-seconds. As you have to do the JDBC part yourself, you do this any manner you choose.

Need good design pattern for caching database query result set

I'm part of a team architecting a Java web application wherein users will search for results in a relational database and then view them in tabular fashion in a browser. Users will then also have the option to subsequently view the same result set (or a subset of those results) in a separate browser window, using for example a charting tool. In other words, we need to give the user the ability to visualize the same result set records later (up to a limit of 24 hours).
Since searches on the system will be resource-intensive and just out of good common sense, we would like a clean way to cache each result set so that it can be pulled later from memory (RAM or disk). We are looking for a good approach to doing this caching, we believe others have done this before, and we prefer to use a best-practice or framework rather than building such a thing from scratch. The server will have plenty of RAM but since there could be hundreds of people using the system, we may need an approach that stores to RAM first but then can also cache to hard disk if RAM is getting full.
I believe it makes most sense to persist as Java objects but I'm open to better advice. We would like a vendor-neutral approach, so that if the database team chooses to switch vendors later we aren't stuck with a proprietary solution. Thanks.

I think what you might be looking for is Terracotta Ehcache. This does everything you mentioned and more. It is a free product that can be used to cache things in memory, overflow to disk, specify max cache sizes by either MB or # of items, and expire based on last access time or entry time.

I've seen http://www.jboss.org/infinispan/ used to do exactly that. It can cache to memory, disk and or database. I wouldn't say I love it (the configuration is not super easy and documentation is somewhat lacking) but it most certainly works and is actively maintained.

Being vendor neutral is all about writing an abstraction layer that is native to your application, then plugging in the cache service you would like to use behind this layer, while keeping your layer that exposes these operations to your main code the same.
There are plenty of ways to cache. Look into using various NoSql solutions.
Redis
Memcached
Most of the time you will serialize your object and persist it to your cache layer.

Hold most of the object in cache/memory insted of database?

It just occurred to me why not to have most of the objects in a cache(memory) when an application start.
if it's not that large web application. Or to have a settings for how much I want to put in the cache/memory.
I just guess it could require to have something like below 1 GB RAM or a lot less.
Everything in order to speed up the application even more by not querying database.
Is it good idea?

Caching is definitely a good idea and is widely used, but it has to be implemented correctly. There are plenty of pitfalls if done incorrectly. Try looking into one of the big proven systems, like memcached.

Caching is definitely a good idea.
Databases are also not a catch-all solution, though you have to be careful about consistency between runs of your program. What if you change the data but your program crashes before you update it to the database?
There are also lightweight memory resident databases that can let you keep your current queries for now, but run much stuff from memory. Using an ORM tool instead of SQL is particularly effective for this since the switch is almost transparent.

Quickly becomes Not so good idea, when some other node starts updating database.
In that case your cache will be holding stale data.

You can maintain a cache of Frequently Used objects in the memory, just don't forget to add methods to refresh the cache when the underlying database state changes.
Eg: If you have a user's table and you need user names in many many pages, then load the entire table in cache at time of Application Startup, just make sure to update the cache when you are adding new users online or modifying / deleting entries from user table

You don't persist objects to database. What you persist is object's state. So that you can have exactly the same state even after your app stops/closes/restarts. If you want to keep states of your objects persisted, you have no choice, but to use db (or anything else, that allows you to write data to file system).

The details are beyond the scope of an answer here, but we have had good experience of using ehCache ( http://ehcache.org/ )
The combination of support for distributed caches, and overflow to disk has allowed us to keep large numbers of computationally heavy, but fairly unchanging pages in the cache for a site being served from multiple tomcats.
Distribution addresses the question of staleness (if you invalidate your items correctly) and the disk overflow allows us to basically cache everything which was just not feasible with an in-memory cache.
Of course the implementation is not trivial for a real world application, but it improved our performance significantly once the caches were bubbling.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.