Using a cache as a layer in front of a database

Using a cache as a layer in front of a database - java

I'm working on some back-end service which is asynchronous in nature. That is, we have multiple jobs that are ran asynchronously and results are written to some record.
This record is basically a class wrapping an HashMap of results (keys are job_id).
The thing is, I don't want to calculate or know in advance how many jobs are going to run (if I knew, I could cache.invalidate() the key when all the jobs has already been completed)
Instead, I'd like to have the following scheme:
Set an expiry for new records (i.e. expireAfterWrite)
On expiry, write (actually upsert) the record the database
If a cache miss occurs, load() is called to fetch the record from the database (if not found, create a new one)
The problem:
I tried to use Caffeine cache but the problem is that records aren't expired at the exact time they were supposed to. I then read this SO answer for Guava's Cache and I guess a similar mechanism works for Caffeine as well.
So the problem is that a record can "wait" in the cache for quite a while, even though it was already completed. Is there a way to overcome this issue? That is, is there a way to "encourage" the cache to invalidate expired items?
That lead me to question my solution. Would you consider my solution a good practice?
P.S. I'm willing to switch to other caching solutions, if necessary.

You can have a look at the Ehcache with write-behind. It is for sure more setup effort but it is working quite well

Related

Collection processing or database request ? which one is better

This is my first post on stackoverflow, so please be nice to me :-)
So let me explain the context. I'm developing a web service with a standard layer (resources, services, DAO Layer...). I use JPA with hibernate implementation for my object model with the database.
For a class A parent and a class B child, most of the time when i want to find an object B on the collection, I use the streamAPI to filter the collection based on what i want. My question here is more general, is it better to search an object by requesting the database (from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)

If you consider latency, the database will always be slower.
So you gotta ask yourself some questions:
how far away is the database (latency)?
how big is the dataset?
How do I process them ?
do I have any major runtime issues ?
from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
You're program is probably not very performant programmed. I suggest you check the O-Notation if you have any major runtime leaks.
Your Question is very broad, so it's hard to tell you, for your use-case, which might be the best.

Use database to return data what you need and Java to perform processing on them that would be complicated to do in a JPQL/SQL query.
Databases are designed to perform queries more efficiently than Java (stream or no).
Besides, fetching many data from a database to finally keep only a part of them is not efficient.

The database is usually faster since it is optimized for requesting specific data. Usually one would add indexes to speed up querying on certain fields.
TLDR: Filter your data in the database and process them from java.

This isn't an easy question to answer, since there are many different factors that would influence my decision to go to the db or not. First, I think it's fair to say that, for almost every app I've worked on in the past 20 years, hitting the DB for information is the default strategy. More recently (say past 10 or so years) data access through web service calls has become common as well.
For me, the main question would be something along the lines of, "Are there any situations when I would not hit an external resource (DB, Service, or even file read) for data every time I need it?"
So, I'll outline some of the things I would consider.
Is the data search space very small?
If you are searching a data space of tens of different records, then this information might be a candidate for non-db storage. On the other hand, once you get past a fairly small set records, this approach becomes increasingly untenable. Examples of these "small sets" might be something like salutations (Mr., Ms., Dr., Mrs., Lord). I looks for small sets of data that rarely change, which I, as a lazy developer, wouldn't mind typing into a configuration file. Once I get past something like 50 different records (like US States, for example), I want to pull that info from a DB or service call.
Are the data cacheable?
If you have multiple requests that could legitimately use the exact same data, then leverage caching in your application. Examine the data and expected usage of your service for opportunities to leverage regularities in data and likely requests to cache data whenever possible. Remember to consider cache keys, how long items should be cached, and when cached items should be evicted.
In many web usage scenarios, it's not uncommon that each display could include a fairly large amount of cached information, and a small amount of dynamic data. Menu and other navigation items are good candidates for caching. User-specific data, such as contract-sepcific pricing in an eCommerce app are often poor candidates.
Can you pre-load some data into cache?
Some items can be read once and cached for the entire duration of your application. A list of US States and/or Canadian Provinces is a good example here. These almost never change, so once read from the db, you would rarely need to read them again. Consider application components that can load such data on startup, and then hold this data in an appropriate collection.

Duplicates because Oracle is too fast or multithread?

I have a problem with duplicate records arriving in our database via a Java web service, and I think it's to do with Oracle processing threads.
Using an iPhone app we built, users add bird observations to a new site they visit on holiday. They create three records at "New Site A" (for example). The iPhone packages each of these three records into separate JSON strings containing the same date and location details.
On Upload, the web service iterates through each JSON string.
Iteration/Observation 1. It checks the database to see if the site exists, and if not, creates a new site and adds the observation into a hanging table.
Iteration/Obs 2. The site should now exists in the database, but it isn't found by the database site check in Iteration 1, and a second new site is created.
Iteration/Obs 3. The check for existing site NOW WORKS, and the third observation is attached to one of the existing sites. So the web service and database code does work.
The web service commits at the end of each iteration.
Is the reason for the second iteration not finding the new site in the database due to delays in Oracle commit after it's been called by the Java, so that it's already started processing iteration 2 by the time iteration 1 is truly complete, OR is it possible that Oracle is running each iteration on a separate thread?
One solution we thought about was to use Thread.sleep(1000) in the web service, but I'd rather not penalize the iPhone users.
Thanks for any help you can provide.
Iain

Sounds like a race condition to me. Probably your observation 1 and 2 are arriving very close to each other, so that 1 is still processing when 2 arrives. Oracle is ACID-compliant, meaning your transaction for observation 2 cannot see the changes made in transaction one, unless this one was completed before transaction two started.
If you need a check-then-create functionality, you'd best synchronize this at a single point in your back end.
Also, add a constraint in your DB to avoid the duplication at all costs.

It's not an Oracle problem; Thread.sleep would be a poor solution, especially since you don't know root cause.
Your description is confusing. Are the three JSON strings sent in one HTTP request? Does the order matter, or does processing any of them first set up the new location for the ones that follow?
What's a "hanging table"?
Is this a parent-child relation between location and observation? So the unit of work is to INSERT a new location into the parent table followed by three observations in the child table that refer back to the parent?
I think it's a problem with your queries and how they're written. I can promise you that Oracle is fast enough for this trivial problem. If it can handle NASDAQ transaction rates, it can handle your site.
I'd write your DAO for Observation this way:
public interface ObservationDao {
void saveOrUpdate(Observation observation);
}
Keep all the logic inside the DAO. Test it outside the servlet and put it aside. Once you have it working you can concentrate on the web app.

Java synchronization options for preventing duplicate orders (file, db locking?)

I have two use cases for placing an order on a website. One is directly submitted from a web front end with a creditcard, and the other is a notification of an external payment from a processor like paypal. In both situations, I need to ensure that the order is only placed one time.
I would like to use the same mechanism for both scenarios if possible, to help with code reuse. In the first use case, the user can submit the order form multiple times and result in different theads trying to place an order. I can use ajax to stop this, but I need a server side solution for certainty. In the second usecase, the notification messages may be sent through in duplicates so I need to protect against that too.
I want the solution to be scalable across a distributed environment, so a memory lock is out of the question. I was looking at saving a unique token to the database to prevent multiple submissions there, but I really don't want to be messing with the existing database transactions. The real solution it seems is to lock on something external like a file in a shared location across jvms.
All orders have a unique long id, so I could use that to synchronize. What would be the best way of doing this? I could potentially create a file per id, or do something fancier with a region of the file. However I don't have much experience with file locking, so if there is a better option I would love to hear it. Any code samples would help very much.

If you already have a unique long id, nothing better than a simple database table with manually assigned primary keys can't happen to you. Every RDBMS (and also key-value NoSQL databases) will effectively and efficiently discover primary keys clashes. It is basically:
Start transaction
INSERT INTO orders VALUES (your_unique_id)
Commit
Depending on the database, 2. or 3. will throw an exception which you can easily catch.
If you really want to avoid databases (could you elaborate a little bit more why?), you can:
Use file locking (nasty and not scalable), don't go that way.
In-memory locking with clustering (with Terracotta it's like working with normal boolean that is magically clustered)
Queuing requests and having only single consumer.
Using JMS and single-threaded consumer looks promising, however you still have to discover duplicates (but at least you avoid concurrently placed orders) and it might be terribly slow...

Google App Engine + Memcache how to get all data from cache

Inside my system I have data with a short lenght of life, it means that the data is still actuall not for a long time but shold be persisted in data store.
Also this data may be changed frequently for each user, for instance each minute.
Potentially amount of users maybe large enough and I want to speed up the put/get process of this data by usage of memcache and delayed persist to the bigtable.
No problems just to put/get objects by keys. But for some use cases I need to retrieve all data from cache that still alive but api allows me to get data only by keys. Hence I need to have some key holder that knows all keys of the data inside memcache... But any object may be evicted and I need to remove this key from global registry of keys (but such listener doesn't work in GAE). To store all this objects in the list a map is not accaptable for my solution because each object should has it's own time to evict...
Could somebody recommend me in which way I should move?

It sounds like what you really are attempting to do is have some sort of queue for data that you will be persisting. Memcache is not a good choice for this since as you've said, it is not reliable (nor is it meant to be). Perhaps you would be better off using Task Queues?

Memcache isn't designed for exhaustive access, and if you need it, you're probably using it the wrong way. Memcache is a sharded hashtable, and as such really isn't designed to be enumerated.
It's not clear from your description exactly what you're trying to do, but it sounds like at the least you need to restructure your data so you're aware of the expected keys at the time you want to write it to the datastore.

Since I am encountering the very same problem, which I might solve by building a decorator function and wrap the evicting function around it so that key to the entity is automatically deleted from key directory/placeholder on memcache, i.e. when you call for eviction.
Something like this:
def decorate_evict_decorator(key_prefix):
def evict_decorator(evict):
def wrapper(self,entity_name_or_id):#use self if the function is bound to a class.
mem=memcache.Client()
placeholder=mem.get("placeholder")#could use gets with cas
#{"placeholder":{key_prefix+"|"+entity_name:key_or_id}}
evict(self,entity_name_or_id)
del placeholder[key_prefix+"|"+entity_name]
mem.set("placeholder",placeholder)
return wrapper
return evict_decorator
class car(db.Model):
car_model=db.StringProperty(required=True)
company=db.StringProperty(required=True)
color=db.StringProperty(required=True)
engine=db.StringProperty()
#classmethod
#decorate_evict_decorator("car")
evict(car_model):
#delete process
class engine(db.Model):
model=db.StringProperty(required=True)
cylinders=db.IntegerProperty(required=True)
litres=db.FloatProperty(required=True)
manufacturer=db.StringProperty(required=True)
#classmethod
#decorate_evict_decorator("engine")
evict(engine_model):
#delete process
You could improve on this according to your data structure and flow. And for more on decorators.
You might want to add a cron to keep your datastore in sync the memcache at a regular interval.

caching readonly data for java application

I have a database which has around 150K records of data with a primary key on the table. The data size for each record will take less than 1kB. The processing time for constructing a POJO from the DB record takes about 1-2 secs(there is some business logic that takes too much time). This is read-only data. Hence I'm planning to implement caching the data. What I'm thinking to do is. Load the data in subsets(200 records each time) and create a thread that'll construct the POJOs and keep them in a hashtable. While the cache is being loaded(when I start the application) the User will see a wait sign. For storing the data in HashTable is an issue I'll actually store the processed data in to another DB table(marshall the POJO to xml).
I use a third party API to load the data from database. Once I load a record I'll have load the data I'll have to load associations for the loaded data and then associations for the association found at the top level. It's like loading a family tree.
I can't use Hibernate or any ORM framework as I'm using a third party API to load the data which is shipped with the database it self(it's a product). More over I don't think loading data once is not a big issue.
If there is a possibility to fine tune the business logic I wouldn't have asked this question here.
Caching the data on demand is an option, but I'm trying to see if I can do anything better.
Suggest me if there is a better idea that you are aware of. Thank you./

Suggest me if there is a better idea that you are aware of.
Yes, fix the business logic so that it doesn't take 1 to 2 seconds per record. That's a ridiculously long time.
Before you do that, profile your application to make sure that it is really the business logic that is causing the slow record loading, and not something else. (For example, it could be a pathological data structure, or a database issue.)
Once you've fixed the root cause of the slow record loading, it is still a good idea to cache the read-only records, but you probably don't need to preload the cache. Instead, just load the records on demand.

It sounds like you are reinventing the wheel. I'd be looking to use hibernate. Apart from simplifying the code to access the database, hibernate has built-in caching and lazy loading of data so it only creates objects as you request them. Ergo, a lot of what you describe above is already in place and you can concentrate on sorting out your business logic. I suspect that once you solve the business logic performance issue, there will be no need to do such as complicated caching system and hibernate defaults will be sufficient.

As maximdim said in a comment, preloading the whole thing will take a lot of time. If your system is not very strange, the user won't need all data at once. Just cache on demand instead. I would also recommend using an established caching solution, such as EHCache, which has persistence via DiskStore -- the only issue is that whatever you cache in this case has to be Serializable. Since you can marshall it as XML, I'm betting you can serialize it too, which should be faster.
In a past project, we had to query a very busy, very sluggish service running in an off-site mainframe in order to assemble one of the entities. Average response times from our app were dominated by this query. Since the data we retrieved was mostly read-only caching with EHCache solved our problems.

jdbm has a nice, persistent map implementation (http://code.google.com/p/jdbm2/) - that may help you do local caching - it would certainly be a lot faster than serializing your POJOs to XML and writing them back into a SQL database.
If your data is truly read-only, then I'd think that the best solution would be to treat the source database as an input queue that feeds your app database. Create a background process (heck, a service would be better), and have it monitor the source database and keep your app database synced.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.