many to many relationship using objectify? - java

I am moving my application from a relational DB to objectify / google app engine.
The application has a relationship which is modelled as follows:
One Message can be sent to many Users. Each User can have many Messages addressed to them.
I need to be able to scan for all Messages addressed to a particular User.
How do I do this with Objectify?

There are a number of ways to do it.
You can save a list of messages in the user object. This will work nicely with your requirement to get all messages addressed to a user, as there is no need to do a query.
You can save a list of users in the message object. To get all the messages addressed to a single user, do a query.
you can save BOTH lists above. Remember, in App Engine there is usually no need to normalize and worry about disk space and duplicates. Almost always build your structure so that queries will be fast.
You can forget about lists, and have Relationship objects just like a table in a relational database. It can still be the decent options in App Engine in some use cases, for example when the lists are just too big (thousands) and will bloat your objects and may not even be query-able.
The most important variable that will determine which approach to take in relation to the query you specified, is how many messages will usually be addressed to a single user, and will there be a maximum number of messages? If we are talking about average of dozens or less and maximum of hundreds, a list of messages in the user object sounds to me like a good option. If we are talking about more, and especially if unlimited, it won't work so well, and you will need to make an actual query.

Beyond the answers already posted I would suggest that you not include a link from User to the Message, for three reasons:
Collections in GAE are hard limited to 5000 items. As soon as your user's inbox exceeds 5k items your app will start throwing exceptions.
There is a performance cost to expanding the quantity of data in an entity; loading a bunch of 500k entities is slower than loading a bunch of 5k entities. Plus your usage of memcache will be less effective since you can fit fewer items in the same space. User objects tend to get loaded a lot.
You can easily hit the transaction rate limit for a single entity (1/s). If 50 people send you a message at the same time, you will have massive concurrency problems as all 50 retry with optimistic failures.
If you can live with a limit of 5000 recipients for a single message, storing the Set of destination keys in the Message (and indexing this set so you can query for all messages of a user) is probably a great solution. There is almost certainly an advantage also to assigning the message a #Parent of the sender.
If you are twitter-like and expect a message to have more than 5k recipients, or if your messages typically have a lot of recipients (thus the message entity is bloated), you may wish to consider the Relation Index Entity pattern that Brett Slatkin talked about in his Google I/O talk from 2009: https://www.youtube.com/watch?v=AgaL6NGpkB8

You have to maintain the relationship on your own. This is because depending on the application it would make sense to let the users exist without messages, or even the opposite.
The suggested approach by Objectify Wiki (https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify Multi-Value Relationship) is to keep a collection(or array) of keys
public class Message
{
#Id String timeStamp;
Key<User>[] destination;
}
public class User
{
#Id String name;
Key<Message>[] inbox;
}
Then if you want to remove all user messages when the user is removed, just remove them from the datastore before the user. Also is exactly the same if you want to add a new message for a particular user.

Related

Collection processing or database request ? which one is better

This is my first post on stackoverflow, so please be nice to me :-)
So let me explain the context. I'm developing a web service with a standard layer (resources, services, DAO Layer...). I use JPA with hibernate implementation for my object model with the database.
For a class A parent and a class B child, most of the time when i want to find an object B on the collection, I use the streamAPI to filter the collection based on what i want. My question here is more general, is it better to search an object by requesting the database (from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
If you consider latency, the database will always be slower.
So you gotta ask yourself some questions:
how far away is the database (latency)?
how big is the dataset?
How do I process them ?
do I have any major runtime issues ?
from my point of view this gonna cause a lot of calls to the database but it's gonna use less CPU), or do the opposite by searching over the model object and process over collection (this gonna cause less database calls, but more CPU process)
You're program is probably not very performant programmed. I suggest you check the O-Notation if you have any major runtime leaks.
Your Question is very broad, so it's hard to tell you, for your use-case, which might be the best.
Use database to return data what you need and Java to perform processing on them that would be complicated to do in a JPQL/SQL query.
Databases are designed to perform queries more efficiently than Java (stream or no).
Besides, fetching many data from a database to finally keep only a part of them is not efficient.
The database is usually faster since it is optimized for requesting specific data. Usually one would add indexes to speed up querying on certain fields.
TLDR: Filter your data in the database and process them from java.
This isn't an easy question to answer, since there are many different factors that would influence my decision to go to the db or not. First, I think it's fair to say that, for almost every app I've worked on in the past 20 years, hitting the DB for information is the default strategy. More recently (say past 10 or so years) data access through web service calls has become common as well.
For me, the main question would be something along the lines of, "Are there any situations when I would not hit an external resource (DB, Service, or even file read) for data every time I need it?"
So, I'll outline some of the things I would consider.
Is the data search space very small?
If you are searching a data space of tens of different records, then this information might be a candidate for non-db storage. On the other hand, once you get past a fairly small set records, this approach becomes increasingly untenable. Examples of these "small sets" might be something like salutations (Mr., Ms., Dr., Mrs., Lord). I looks for small sets of data that rarely change, which I, as a lazy developer, wouldn't mind typing into a configuration file. Once I get past something like 50 different records (like US States, for example), I want to pull that info from a DB or service call.
Are the data cacheable?
If you have multiple requests that could legitimately use the exact same data, then leverage caching in your application. Examine the data and expected usage of your service for opportunities to leverage regularities in data and likely requests to cache data whenever possible. Remember to consider cache keys, how long items should be cached, and when cached items should be evicted.
In many web usage scenarios, it's not uncommon that each display could include a fairly large amount of cached information, and a small amount of dynamic data. Menu and other navigation items are good candidates for caching. User-specific data, such as contract-sepcific pricing in an eCommerce app are often poor candidates.
Can you pre-load some data into cache?
Some items can be read once and cached for the entire duration of your application. A list of US States and/or Canadian Provinces is a good example here. These almost never change, so once read from the db, you would rarely need to read them again. Consider application components that can load such data on startup, and then hold this data in an appropriate collection.

Overcoming 30 sub-query limit for google datastore

Google datastore started off looking so good and has become so frustrating, but maybe it's just that I'm used to relational databases. I'm pretty new to datastore and nosql in general and have done a ton of research but can't seem to find a solution to this problem.
Assume I have a User class that looks like this
class User{
#Id
Long id;
String firstName, lastName;
List<Key<User>> friends;
}
I have another class that will model Events that users have done like so
class Event{
Key<User> user;
Date eventTime;
List<Key<User>> receivers;
}
and now what I'm trying to do is query for events that my friends have done.
In a usual relational way I would say :
select * from Event where user in (select friends from User where id = ?)
Taking that as a starting point I tried doing
// Key<User> userKey = ...
User user = ofy.load.type(User.class).key(userKey).first.now;
List<Key<User>> friends = user.getFriends();
ofy.load.type(Event.class).filter("user in", friends).order("-eventTime")list();
But I heard about this 30 sub-query limit making this unsustainable since I assume eventually someone will have more than 30 friends, not to mention using an 'in' clause will guarantee that you cannot get a cursor to continue loading events. I've done so much research and tried so many options but have yet to find a good way to approach this problem except to say "why Google, why."
Things I've considered :
add an extra field in event that is a copy of the users friendlist and use a single equals on MVP to find events (extremely wasteful since there may be many many events.
split event query up into batches of 30 friends at a time and somehow determine a way to ensure continued retrieval from a synthetic cursor based on time, and merge them (problem is waay too many edge cases and makes reading events very difficult.)
I would really appreciate any input you could offer since I am 100% out of ideas
TL;DR ~ GAE has limit on how many items an in-clause can handle and fml.
You come from a relational database background, so the concept of denormalization is probably a bit painful - I know it was for me.
Right now you have a single table that contains all events from all users. This approach works well in relational databases but is a nightmare in the datastore for the reasons you named.
So to solve this concrete problem you could restructure your data as follows:
All users have two timelines. One for their own posts and one from friends' posts. (There could be a third timeline for public stuff.)
When a new event is published, it is written to the timeline of the user who created it, and to all the timelines of the receiving users. (You may want to add references of the third-party timelines in the user's timeline, so you know what to delete when the user decides to delete an event)
Now every user has access to complete timelines, his/her own and the timeline that was created by third-party events. Those timelines are easy to query and you will not require sub-selects at all.
There are downsides to this approach:
Writing cost is higher. You have to write way more timelines than you had to until now. You will probably have to put this in a task queue to have enough time to write to all those timelines.
You're using a lot more storage, BUT storage is really cheap, I'm guessing the storage will be cheaper than running expensive queries in the long run.
What you get in return though is lightning fast responses with simple queries through this denormalization. All that remains is to merge the responses from the different timelines in the UI (you can do it on the server side, but i would do it in the UI)

Modeling a system with users who can send messages to each other in GAE datastore

I was wondering whether you could help me work out a way to model the following in a GAE datastore such that it is scalable and can be updated frequently. I thought I had a solution which I expressed in this question but whilst waiting for replies I realise that it might be overly complicated. I have explained below why I have kept it as a separate question.
Problem:
Building a system with users who can send many messages to each other. Each user must be able to retrieve their messages - like online chat. Would like to avoid contention when possibly a user may receive many messages over a short time.
Solution 1:
As mentioned here I am wondering whether a sharded list can be used to implement this. By this I mean have messages stored as entity objects and sender and receiver store the keys of these objects (the messages sent between them) in a list. I thought of sharding because a user who receives many messages would have to update the list frequently and a sharding approach could have prevented datastore contention.
Problem - what happens when the list of keys to say a user's received messages gets large? Will appending to it not become slow? I could split the list over several entities but this would take careful thought on allocation schemes and ways of retrieval. Willing to do this if it is the best way.
Alternative approach:
Store messages as entity objects (as above) but this time have them store a properties which are indexed (date, from, to, etc). Retrieve messages for a user using queries (date greater than..., from=... etc). This could work well but I worry - will all the indexing degrade as they will grow extremely large with many users sending many messages? Seems like it will degrade into an SQL like system.
Any thoughts?
I have read about how to model complex relations in the GAE docs but they are using python for an example and I am having trouble abstracting the overall design pattern.
Many thanks to anyone with input on this
PS at the moment using the low level datastore directly.
I have created a system similar to this before. The way I chose to implement it was that i created a Conversation entity, that was the parent for many Message entities. A conversation had two participants (although you could do more), each of which was the key to a User entity.
Something like this (assuming ofy)
#Entity public class Conversation {
#Id Long id;
#Index Key<User> participant1;
#Index Key<User> participant2;
#Index String participant1ExternalId;
#Index String participant2ExternalId;
}
#Entity public class Message {
#Id Long id;
#Parent Ref<Conversation> conversation;
#Index String senderExternalId;
#Index String recipientExternalId;
String message;
}
In this way, you can query all conversations for a participant in an eventually consistent fashion, and all messages received or sent (or both) for a conversation in a strongly consistent fashion. I had an extra requirement of users not being able to identify eachother, and so messaging used generated UUIDs (the externalId properties).
So in this way, sharding and the 1 write/sec limit applies at a conversation level. You could put unread counters onto the conversation object for each user, or on each message if you needed to (at a contention level it makes no real difference, so whatever makes most sense).
If your users are regularly exceeding 1 message per second per conversation you'll have a lot of other problems to solve beyond just datastore contention, so its probably a good starting point. In the general case, eventual consistency will work very well for this sort of operation (i.e. checking for new messages), so you can lean heavily on that.

Building a sharded list in Google App Engine

I am looking for a good design pattern for sharding a list in Google App Engine. I have read about and implemented sharded counters as described in the Google Docs here but I am now trying to apply the same principle to a list. Below is my problem and possible solution - please can I get your input?
Problem:
A user on my system could receive many messages kind of like a online chat system. I'd like the server to record all incoming messages (they will contain several fields - from, to, etc). However, I know from the docs that updating the same entity group often can result in an exception caused by datastore contention. This could happen when one user receives many messages in a short time thus causing his entity to be written to many times. So what about abstracting out the sharded counter example above:
Define say five entities/entity groups
for each message to be added, pick one entity at random and append the message to it writing it back to the store,
To get list of messages, read all entities in and merge...
Ok some questions on the above:
Most importantly, is this the best way to go about things or is there a more elegant/more efficient design pattern?
What would be a efficient way to filter the list of messages by one of the fields say everything after a certain date?
What if I require a sharded set instead? Should I read in all entities and check if the new item already exists on every write? Or just add it as above and then remove duplicates whenever the next request comes in to read?
why would you want to put all messages in 1 entity group ?
If you don't specify a ancestor, you won't need sharding, but the end user might see some lagging when querying the messages due to eventual consistency.
Depends if that is an acceptable tradeoff.

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh
First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.
Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B
Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?
If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html
The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

Categories

Resources