Google App Engine + Memcache how to get all data from cache - java

Inside my system I have data with a short lenght of life, it means that the data is still actuall not for a long time but shold be persisted in data store.
Also this data may be changed frequently for each user, for instance each minute.
Potentially amount of users maybe large enough and I want to speed up the put/get process of this data by usage of memcache and delayed persist to the bigtable.
No problems just to put/get objects by keys. But for some use cases I need to retrieve all data from cache that still alive but api allows me to get data only by keys. Hence I need to have some key holder that knows all keys of the data inside memcache... But any object may be evicted and I need to remove this key from global registry of keys (but such listener doesn't work in GAE). To store all this objects in the list a map is not accaptable for my solution because each object should has it's own time to evict...
Could somebody recommend me in which way I should move?

It sounds like what you really are attempting to do is have some sort of queue for data that you will be persisting. Memcache is not a good choice for this since as you've said, it is not reliable (nor is it meant to be). Perhaps you would be better off using Task Queues?

Memcache isn't designed for exhaustive access, and if you need it, you're probably using it the wrong way. Memcache is a sharded hashtable, and as such really isn't designed to be enumerated.
It's not clear from your description exactly what you're trying to do, but it sounds like at the least you need to restructure your data so you're aware of the expected keys at the time you want to write it to the datastore.

Since I am encountering the very same problem, which I might solve by building a decorator function and wrap the evicting function around it so that key to the entity is automatically deleted from key directory/placeholder on memcache, i.e. when you call for eviction.
Something like this:
def decorate_evict_decorator(key_prefix):
def evict_decorator(evict):
def wrapper(self,entity_name_or_id):#use self if the function is bound to a class.
mem=memcache.Client()
placeholder=mem.get("placeholder")#could use gets with cas
#{"placeholder":{key_prefix+"|"+entity_name:key_or_id}}
evict(self,entity_name_or_id)
del placeholder[key_prefix+"|"+entity_name]
mem.set("placeholder",placeholder)
return wrapper
return evict_decorator
class car(db.Model):
car_model=db.StringProperty(required=True)
company=db.StringProperty(required=True)
color=db.StringProperty(required=True)
engine=db.StringProperty()
#classmethod
#decorate_evict_decorator("car")
evict(car_model):
#delete process
class engine(db.Model):
model=db.StringProperty(required=True)
cylinders=db.IntegerProperty(required=True)
litres=db.FloatProperty(required=True)
manufacturer=db.StringProperty(required=True)
#classmethod
#decorate_evict_decorator("engine")
evict(engine_model):
#delete process
You could improve on this according to your data structure and flow. And for more on decorators.
You might want to add a cron to keep your datastore in sync the memcache at a regular interval.

Related

Java : relational database vs static variable

I have a web application in which I'm maintaining many static Maps to store my relevant information. Since the application is deployed on a server. Each and every hit to the server side java uses these maps to match the key and get appropriate result and send back to the client side. My code contains a rank and retrieval feature so I have to read the entire keySet of each of these Maps.
My question is:
1. Is working with static variables better than storing this data in a local embedded DB like Apache Derby and then using it?
2. The use of this data is very frequent. So if I use database will that be faster approach? Since I read the full keyset the where clause may not come handy in many operations.
3. How does the server's memory gets impacted on holding data in static variables?
My no. of maps are fixed but the size of the Maps keeps increasing? Please suggest the better solution.
If you want the data to be saved regularly an embedded database like H2 makes sense. You then also have snapshots of the data, and development, structural changes are a bit more safe.
A real database also has an incredible power behind it: concurrency, caching and so on. An embedded (when file based) database less so.
The problem with maps is that the data extraction can become several indirections. It is more versatile to have SQL queries with joins on the tables.
So SQL is more abstract (does not prescribe the actual query implementation), and easier to test. SQL for instance releases the developer of programming reports.
So go for a database IMHO, when you are really doing hard work.
What you might want to consider is to store the data searched in map when it's searched.
For instance, if a user searches for something specific, that something is stored in the map so that the next user who searches for that gets the data directly from the map rather than the database.
There are some downsides though, as you need to make sure that if the data is changed on the database, the hashmap/cache should be cleared or updated with the new data, as to prevent feeding outdated data to the user.
As for the impact on the server's memory, it depends on the size of the data you're storing. It's hard to give you a precise answer, but you can however test that on your own:
long memoryBefore = Runtime.getRuntime().freeMemory();
// populate your map
long memoryAfter = Runtime.getRuntime().freeMemory();
System.out.println(memoryBefore - memoryAfter);
That should give you the amount of bytes used (more or less, depending on the operations you run between memoryBefore and memoryAfter, as you may have instantiated other classes/variables unrelated to the hashmap)

Collection processing or database request ? which one is better

This is my first post on stackoverflow, so please be nice to me :-)
So let me explain the context. I'm developing a web service with a standard layer (resources, services, DAO Layer...). I use JPA with hibernate implementation for my object model with the database.
For a class A parent and a class B child, most of the time when i want to find an object B on the collection, I use the streamAPI to filter the collection based on what i want. My question here is more general, is it better to search an object by requesting the database (from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
If you consider latency, the database will always be slower.
So you gotta ask yourself some questions:
how far away is the database (latency)?
how big is the dataset?
How do I process them ?
do I have any major runtime issues ?
from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
You're program is probably not very performant programmed. I suggest you check the O-Notation if you have any major runtime leaks.
Your Question is very broad, so it's hard to tell you, for your use-case, which might be the best.
Use database to return data what you need and Java to perform processing on them that would be complicated to do in a JPQL/SQL query.
Databases are designed to perform queries more efficiently than Java (stream or no).
Besides, fetching many data from a database to finally keep only a part of them is not efficient.
The database is usually faster since it is optimized for requesting specific data. Usually one would add indexes to speed up querying on certain fields.
TLDR: Filter your data in the database and process them from java.
This isn't an easy question to answer, since there are many different factors that would influence my decision to go to the db or not. First, I think it's fair to say that, for almost every app I've worked on in the past 20 years, hitting the DB for information is the default strategy. More recently (say past 10 or so years) data access through web service calls has become common as well.
For me, the main question would be something along the lines of, "Are there any situations when I would not hit an external resource (DB, Service, or even file read) for data every time I need it?"
So, I'll outline some of the things I would consider.
Is the data search space very small?
If you are searching a data space of tens of different records, then this information might be a candidate for non-db storage. On the other hand, once you get past a fairly small set records, this approach becomes increasingly untenable. Examples of these "small sets" might be something like salutations (Mr., Ms., Dr., Mrs., Lord). I looks for small sets of data that rarely change, which I, as a lazy developer, wouldn't mind typing into a configuration file. Once I get past something like 50 different records (like US States, for example), I want to pull that info from a DB or service call.
Are the data cacheable?
If you have multiple requests that could legitimately use the exact same data, then leverage caching in your application. Examine the data and expected usage of your service for opportunities to leverage regularities in data and likely requests to cache data whenever possible. Remember to consider cache keys, how long items should be cached, and when cached items should be evicted.
In many web usage scenarios, it's not uncommon that each display could include a fairly large amount of cached information, and a small amount of dynamic data. Menu and other navigation items are good candidates for caching. User-specific data, such as contract-sepcific pricing in an eCommerce app are often poor candidates.
Can you pre-load some data into cache?
Some items can be read once and cached for the entire duration of your application. A list of US States and/or Canadian Provinces is a good example here. These almost never change, so once read from the db, you would rarely need to read them again. Consider application components that can load such data on startup, and then hold this data in an appropriate collection.

Using hashmap or H2 database?

I am developing a web application in which I need to store session, user messages etc. I am thinking of using HashMap or H2 database.
Please let me know which is better approach in terms of performance and memory utilization. The web site has to support 10,000 users.
Thanks.
As usual with these questions, I would worry about performance as/when you know it's an issue.
10000 users is not a lot of data to hold in memory. I would likely start off with a standard Java collection, and look at performance when you predict it's going to cause you grief.
Abstract out the access to this Java collection such that when you substitute it, the refactoring required is localised (and perhaps make it configurable, such that you can easily perform before/after performance tests with your different solutions -H2, Derby, Oracle, etc. etc.)
If your session objects aren't too big (which should be the case), there is no need to persist them in a database.
Using a database for this would add a lot of complexity in a case when you can start with a few lines of code. So don't use a database, simply store them in a ligth memory structure (HashMap for example).
You may need to implement a way to clean your HashMap if you don't want to keep sessions in memory when the user left from a long time. Many solutions are available (the easiest is simply to have a background thread removing from time to time the too old sessions). Note that it's usually easier to clean a hashmap than a database.
Both H2 and Hash Map are gonna keep the data in memory (So from space point of view they are almost the same).
If look ups are simple like KEY VALUE then looking up in the Hash Map will be quicker.
If you have to do comparisons like KEY < 100 etc use H2.
In fact 10K user info is not that high a number.
If you don't need to save user messages - use the collections. But if the message is should be saved, be sure to use a database. Because after restart you lost all data.
The problem with using a HashMap for storing objects is that you would run into issues when your site becomes too big for one server and would need to be clustered in order to scale with demand. Then you would face problems with how to synchronise the HashMap instances on different servers.
A possible alternative would be to use a key-value store like Redis as you won't need the structure of a database or even use the distributed cache abilities of something like EHCache

How to retrieve huge (>2000) amount of entities from GAE datastore in under 1 second?

We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.
200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.
Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.
This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards
In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.
Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!

Java synchronization options for preventing duplicate orders (file, db locking?)

I have two use cases for placing an order on a website. One is directly submitted from a web front end with a creditcard, and the other is a notification of an external payment from a processor like paypal. In both situations, I need to ensure that the order is only placed one time.
I would like to use the same mechanism for both scenarios if possible, to help with code reuse. In the first use case, the user can submit the order form multiple times and result in different theads trying to place an order. I can use ajax to stop this, but I need a server side solution for certainty. In the second usecase, the notification messages may be sent through in duplicates so I need to protect against that too.
I want the solution to be scalable across a distributed environment, so a memory lock is out of the question. I was looking at saving a unique token to the database to prevent multiple submissions there, but I really don't want to be messing with the existing database transactions. The real solution it seems is to lock on something external like a file in a shared location across jvms.
All orders have a unique long id, so I could use that to synchronize. What would be the best way of doing this? I could potentially create a file per id, or do something fancier with a region of the file. However I don't have much experience with file locking, so if there is a better option I would love to hear it. Any code samples would help very much.
If you already have a unique long id, nothing better than a simple database table with manually assigned primary keys can't happen to you. Every RDBMS (and also key-value NoSQL databases) will effectively and efficiently discover primary keys clashes. It is basically:
Start transaction
INSERT INTO orders VALUES (your_unique_id)
Commit
Depending on the database, 2. or 3. will throw an exception which you can easily catch.
If you really want to avoid databases (could you elaborate a little bit more why?), you can:
Use file locking (nasty and not scalable), don't go that way.
In-memory locking with clustering (with Terracotta it's like working with normal boolean that is magically clustered)
Queuing requests and having only single consumer.
Using JMS and single-threaded consumer looks promising, however you still have to discover duplicates (but at least you avoid concurrently placed orders) and it might be terribly slow...

Categories

Resources