Google App Engine / Objectify Soft Delete

Google App Engine / Objectify Soft Delete - java

I am using Objectify for my DAO layer on GAE, I wanna make most of my entity soft-delete-able, is it a good idea to make these entities extending a parent with isActive boolean or should I used embedded or should I just make it an interface isSoftDeleteable?
Reason I am asking is that it seems Objectify storing Entity with same parent class in same Entity kind (at least from what I see in the _ah/admin) and it may slow down the query when everything is under the same entity kind, maybe?
Which is the best way or if there is better way to do soft-delete in GAE?
Please advise and Thanks in advance!

There is no single right answer to this question. The optimal solution chiefly depends on what percentage of your entities are likely going to be in deleted state at any given time.
One option is to store a field like #Index(IfTrue.class) boolean active; and add this filter to all queries:
ofy.load().type(Thing.class).filter("size >", 20).filter("active", true)
The downside of this is that it requires adding extra indexes - possibly several because you may now need multi-property indexes where single-property indexes would have sufficed.
Alternatively, you can store a 'deleted' flag and manually exclude deleted entities from query results. Less indexes to maintain, but it adds extra overhead to each query as you pull back records you don't want. If your deleted entries are sparse, this won't matter.
One last trick. You might find it best to store index a deleted date since it's probably most useful: #Index Date deleted; This lets you filter("deleted", null) to get the active items and also lets you filter by datestamp to get really old entities that you may wish to purge. However, be aware that this will cause the deleted date to index into any multi-property indexes, possibly significantly increasing index size if you have a high percentage of deleted entities. In this case, you may wish to #Index(IfNull.class) Date deleted; and use map-reduce to purge sufficiently old entities.

I agree with StickFigure's answer. Take advantage of the difference between an "empty" index and a "null" index. The tradeoff is that each write will incur more datastore write operations - when you add an index, that's at least 2 additional write ops (ascending and descending) indexes that you need every time you update that value. When you delete the index, it's 2 more writes. Personally, I think this is worth while.
Query time should be fairly predictable whenever you do a query on a single property of an entity kind, because if you think about what's happening underneath the covers, you are browsing a list of items in sequential order before doing a parallel batch get of the entity data.

Related

How to partially rebuild index in Hibernate Search 5.10?

I am working on a project where i need to use Hibernate Search and i am gonna index just one entity , it's mapped to a huge table with almost 20 million records and more records daily gonna be added to it but not via the application and entity manager which i am working on and hibernate search can't index new changes automatically. the problem is rebuilding whole index for the entity every day gonna take long time .
I wanted to know is there any way to keep my current index and partially rebuild the index documents for just new changes?

If, at the end of the day, you are able to list all the entities that have been modified during the last 24 hours based on information from the database (a date/time of last change for each entity, for example), then yes, there are ways to do that.
First, you can do it "manually" by running your own Hibernate ORM query and calling FullTextSession.index on each element you want to see reindexed. You will have to do this in batches, preferably opening a transaction for each batch, if you have a large number of elements to reindex.
Another, better option is to use the JSR352 integration, which will however require you to use a JSR352-compatible runtime (Spring Batch is not very standard-compliant and won't work; JBeret is known to work). By targeting your single entity and calling restrictedBy(Criterion) when building the parameters, you will be able to restrict the list of entities to reindex.
For example:
Properties jobProperties = MassIndexingJob.parameters()
.forEntity( MyClass.class )
.restrictedBy( Restrictions.ge( "lastChangeDateTime", LocalDate.now().minus( 26, ChronoUnit.HOURS ) ) // 26 to account for DST switch and other slight delays
.build();
long executionId = BatchRuntime.getJobOperator()
.start( MassIndexingJob.NAME, jobProperties );
The MassIndexer unfortunately doesn't offer such feature yet. You can vote for the feature on ticket HSEARCH-499 and explain your use case in a comment: we'll try to prioritize features that benefit many users. And of course, you can always reach out to us to discuss how to implement this and contribute a patch :)

App Engine + Cloud Datastore performance: order in query or in memory?

Question about Google App Engine + Datastore. We have some queries with several equality filters. For this, we don't need to keep any composed index, Datastore maintains these indexes automatically, as described here.
The built-in indexes can handle simple queries, including all entities of a given kind, filters and sort orders on a single property, and equality filters on any number of properties.
However, we need the result to be sorted on one of these properties. I can do that (using Objectify) with .sort("prop") on the datastore query, which requires me to add a composite index and will make for a huge index once deployed. The alternative I see is retrieving the unordered list (max 100 entities in the resultset) and then sorting them in-memory.
Since our entity implements Comparable, I can simply use Collections.sort(entities).
My question is simple: which one is desired? And even if the datastore composite index would be more performant, is it worth creating all those indexes?
Thanks!

There is no right or wrong approach - solution depends on your requirements. There are several factors to consider:
Extra indexes take space and cost more both in storage costs and in write costs - you have to update every index on every update of an entity.
Sort on property is faster, but with a small result set the difference is negligible.
You can store sorted results in Memcache and avoid sorting them in every request.
You will not be able to use pagination without a composite index, i.e. you will have to retrieve all results every time for in-memory sort.

It depends on your definition of "desired". IMO, if you know the result set is a "manageable" size, I would just do in memory sort. Adding lots of indexes will have cost impact, you can do cost analysis first to check it.

Performance of search in java list vs on database records using hibernate

Now I have a situation where I need to make some comparisons and result filtration that is not very simple to do, what I want is something like Lucenes search but only I will develop it, it is not my decision though I would have gone with Lucene.
What I will do is:
Find the element according to full word match of a certain field, if not then check if it starts with it the check if it just contains.
Every field has its weight according to matching case(full->begins->contains) and its priority to me.
After one has matched I will also check the weight of the other fields as well to make a final total row weight.
Then I will return an Map of both rows and their weights.
Now I realized that this is not easy done by hibernate's HQL meaning I would have to run multiple queries to do this.
So my question is should I do it in java meaning should I retrieve all records and do my calculations to find my target, or should I do it in hibernate by executing multiple queries? which is better according to performance and speed ?

Unfortunately, I think the right answer is "it depends": how many words, what data structure, whether the data fits in memory, how often you have to do the search, etc.
I am inclined to think that a database is a better solution, even if Hibernate is not part of it. You might need to learn how to write better SQL. Perhaps the dynamic SQL that Hibernate generates for you isn't sufficient. Proper JOINs and indexing might make this perform nicely.
There might be a third way to consider: Lucene and indexing. I'd need to know more about your problem to decide.

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh

First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.

Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B

Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?

If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html

The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

Best practice to realize a long-term history-mode for a O/RM system(Hibernate)?

I have mapped several java classes like Customer, Assessment, Rating, ... to a database with Hibernate.
Now i am thinking about a history-mode for all changes to the persistent data. The application is a web application. In case of deleting (or editing) data another user should have the possibility to see the changes and undo it. Since the changes are out of the scope of the current session, i don't know how to solve this in something like the Command pattern, which is recommended for undo functionality.
For single value editing an approach like in this question sounds OK. But what about the deletion of a whole persistent entity? The simplest way is to create a flag in the table if this customer is deleted or not. The complexest way is to create a table for each class where deleted entities are stored. Is there anything in between? And how can i integrate these two things in a O/RM system (in my case Hibernate) comfortably, without messing around to much with SQL (which i want to avoid because of portability) and still have enough flexibility?
Is there a best practice?

One approach to maintaining audit/undo trails is to mark each version of an object's record with a version number. Finding the current version would be a painful effort if the this were a simple version number, so a reverse version numbering works best. "version' 0 is always the current and if you do an update the version numbers for all previous versions are incremented. Deleting an object is done by incrementing the version numbers on the current records and not inserting a new one at 0.
Compared to an attribute-by-attribute approach this make for far simpler rollbacks or historic version views but does take more space.

One way to do it would be to have a "change history" entity with properties for entity id of the entity changed, action (edit/delete), property name, orginal value, new value. Maybe also reference to the user performing the edit. A deletion would create entities for all properties of the deleted entity with action "delete".
This entity would provide enough data to perform undos and viewing of change history.

Hmm I'm looking for an answer to this too. So far the best I've found is the www.jboss.org/envers/ framework but even that seems to me like more work than should be necessary.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.