Google datastore started off looking so good and has become so frustrating, but maybe it's just that I'm used to relational databases. I'm pretty new to datastore and nosql in general and have done a ton of research but can't seem to find a solution to this problem.
Assume I have a User class that looks like this
class User{
#Id
Long id;
String firstName, lastName;
List<Key<User>> friends;
}
I have another class that will model Events that users have done like so
class Event{
Key<User> user;
Date eventTime;
List<Key<User>> receivers;
}
and now what I'm trying to do is query for events that my friends have done.
In a usual relational way I would say :
select * from Event where user in (select friends from User where id = ?)
Taking that as a starting point I tried doing
// Key<User> userKey = ...
User user = ofy.load.type(User.class).key(userKey).first.now;
List<Key<User>> friends = user.getFriends();
ofy.load.type(Event.class).filter("user in", friends).order("-eventTime")list();
But I heard about this 30 sub-query limit making this unsustainable since I assume eventually someone will have more than 30 friends, not to mention using an 'in' clause will guarantee that you cannot get a cursor to continue loading events. I've done so much research and tried so many options but have yet to find a good way to approach this problem except to say "why Google, why."
Things I've considered :
add an extra field in event that is a copy of the users friendlist and use a single equals on MVP to find events (extremely wasteful since there may be many many events.
split event query up into batches of 30 friends at a time and somehow determine a way to ensure continued retrieval from a synthetic cursor based on time, and merge them (problem is waay too many edge cases and makes reading events very difficult.)
I would really appreciate any input you could offer since I am 100% out of ideas
TL;DR ~ GAE has limit on how many items an in-clause can handle and fml.
You come from a relational database background, so the concept of denormalization is probably a bit painful - I know it was for me.
Right now you have a single table that contains all events from all users. This approach works well in relational databases but is a nightmare in the datastore for the reasons you named.
So to solve this concrete problem you could restructure your data as follows:
All users have two timelines. One for their own posts and one from friends' posts. (There could be a third timeline for public stuff.)
When a new event is published, it is written to the timeline of the user who created it, and to all the timelines of the receiving users. (You may want to add references of the third-party timelines in the user's timeline, so you know what to delete when the user decides to delete an event)
Now every user has access to complete timelines, his/her own and the timeline that was created by third-party events. Those timelines are easy to query and you will not require sub-selects at all.
There are downsides to this approach:
Writing cost is higher. You have to write way more timelines than you had to until now. You will probably have to put this in a task queue to have enough time to write to all those timelines.
You're using a lot more storage, BUT storage is really cheap, I'm guessing the storage will be cheaper than running expensive queries in the long run.
What you get in return though is lightning fast responses with simple queries through this denormalization. All that remains is to merge the responses from the different timelines in the UI (you can do it on the server side, but i would do it in the UI)
Related
In my app. I have Case and for each Case there can be 0 to 2 Claim. If a Case has 0 claims it runs pretty fast, 1 claims and it slows down, and 2 is awfully slow. Any idea how to make this faster? I didn't know if my case and claim were going back and forth causing an infinite recurison, so I added a JsonManagedReference and JsonBackReference, but that doesn't seem to help much with speeds. Any ideas? Here is my Case.java:
#Entity
public class Case {
#OneToMany(mappedBy="_case", fetch = FetchType.EAGER)
#Fetch(FetchMode.JOIN)
#JsonManagedReference(value = "case-claim")
public Set<Claim> claims;
}
In Claim.java:
#Entity
public class Claim implements Cloneable {
#ManyToOne(optional = true)
#JoinColumn(name = "CASE_ID")
#JsonBackReference(value = "case-claim")
private Case _case;
}
output of 0 claims:
https://gist.github.com/elmatt/2cafbe7ecb1fa0b7f6a8
output of 2 claims:
https://gist.github.com/elmatt/b000bc28909453effc95
Your problem has nothing to do with the relationship between Case and Claim.
FYI: 300ms is not "pretty fast." Your problem is that you expect hibernate to magically and quickly deliver a complex object hierarchy to you, with no particular effort on your part. I view ORM as "The Big Lie" - it is super easy to use and works great on toy problems, but tends to fail miserably when you try to scale to interesting applications (like yours).
Don't abandon hibernate, but realize that you are going to need to work harder than you thought you would in order to make it work for you.
I happen to work in a similar data domain (post-adjudication healthcare claim analysis and processing). You should be able to select this kind of data in well under 10ms per claim (with all associated dimensions) using MySQL on modest hardware from a table with >1 billion claims and the DB hosted on a separate server from the app.
How do you get from where you are to where you should be?
1. Minimize the number of round-trips to the database by minimizing the number of separate queries that are executed.
2. Hand-craft your important queries to grab just the rows and joins that you actually need.
3. Use explain plan on every query to make sure that it hits the tables in the right order and every step is appropriately supported by an index.
4. Consider partitioning your big tables and include the partition criteria in your queries to enable partition-pruning to focus the query on the proper data.
5. Be very hesitant to let hibernate manage your relationships between your entities. I generally do not let hibernate deal with any relationships.
A few years ago, I worked on a product that is an iPhone app where the user walks through workflows (e.g., a nurse taking a patient's vitals) and each screen made a round-trip to the app server to execute the workflow step and get the data for the next screen. Think about how little data you can work with on an iPhone screen. Yet the DB portion of the round-trip generally took 2-5 seconds to execute. Everyone there took it for granted, because "That is how long it has always taken." I dug into the code and found that each step was pulling in a significant portion of the database (and then was not used by the business logic).
The only time they tweaked the default hibernate behavior was when they got an exception due to too many joins (yes, MySQL has a limit of something like 67 tables in one query).
The approach of creating your Java data model and simply ORM'ing it into the database generally works just fine on configuration data and the like, but tends to perform terribly for complex data models involving your transactional data. This is what is biting you now.
Your problem is totally fixable, and can be attacked incrementally - you don't have to tear apart the whole application to start making things better.
Can you enable hibernate logging and provide the output. It should indicate the SQL queries being executed against your DB. Information about which DB you are using would also be useful. When you have those I would recommend profiling the queries to ensure your DB is setup appropriately. It sounds like an non indexed query.
Size of the datasets would be helpful in targeting possible issues as well - number of rows and so on.
I would also recommend timing the actual hibernate call (could be as crude as log statement immediately before / after) vs overall processing to identify whether it really is hibernate or some other processing. Without further information & context that is not clear here.
Now you've posted your queries we can see what is happening. It looks like the structure of your entities is more complex than the code snippet originally posted. There are references to Person, Activities, HealthPlan and others in there.
As others have commented your query is triggering a very large select of a lot of data due to the nature of your model.
I recommend creating Named Queries for claims, and then load those using the ID of Case.
You should also review your hibernate model and switch to FetchType.LAZY, other hibernate will create large queries such as the one you have posted. The catch here is that if you try to access a related entity outside of the transaction you will get a lazyinitializationexception. You will need to consider each use case and ensure you load the data you need. Two common mistakes with Hibernate is to use FetchType.EAGER everywhere or to initiate the transaction to early to avoid this. There is not one correct design approach, but I normally do the following
JSP -> Controller -> [TX BOUNDARY] Service -> DAO
You service method(s) should encapsulate the business logic you need to load the data you require, before passing it back to the controller.
Again, per the other answer, I think you're expecting too much of Hibernate. It is a powerful tool but you need to understand how it works to get the best from it.
I was wondering whether you could help me work out a way to model the following in a GAE datastore such that it is scalable and can be updated frequently. I thought I had a solution which I expressed in this question but whilst waiting for replies I realise that it might be overly complicated. I have explained below why I have kept it as a separate question.
Problem:
Building a system with users who can send many messages to each other. Each user must be able to retrieve their messages - like online chat. Would like to avoid contention when possibly a user may receive many messages over a short time.
Solution 1:
As mentioned here I am wondering whether a sharded list can be used to implement this. By this I mean have messages stored as entity objects and sender and receiver store the keys of these objects (the messages sent between them) in a list. I thought of sharding because a user who receives many messages would have to update the list frequently and a sharding approach could have prevented datastore contention.
Problem - what happens when the list of keys to say a user's received messages gets large? Will appending to it not become slow? I could split the list over several entities but this would take careful thought on allocation schemes and ways of retrieval. Willing to do this if it is the best way.
Alternative approach:
Store messages as entity objects (as above) but this time have them store a properties which are indexed (date, from, to, etc). Retrieve messages for a user using queries (date greater than..., from=... etc). This could work well but I worry - will all the indexing degrade as they will grow extremely large with many users sending many messages? Seems like it will degrade into an SQL like system.
Any thoughts?
I have read about how to model complex relations in the GAE docs but they are using python for an example and I am having trouble abstracting the overall design pattern.
Many thanks to anyone with input on this
PS at the moment using the low level datastore directly.
I have created a system similar to this before. The way I chose to implement it was that i created a Conversation entity, that was the parent for many Message entities. A conversation had two participants (although you could do more), each of which was the key to a User entity.
Something like this (assuming ofy)
#Entity public class Conversation {
#Id Long id;
#Index Key<User> participant1;
#Index Key<User> participant2;
#Index String participant1ExternalId;
#Index String participant2ExternalId;
}
#Entity public class Message {
#Id Long id;
#Parent Ref<Conversation> conversation;
#Index String senderExternalId;
#Index String recipientExternalId;
String message;
}
In this way, you can query all conversations for a participant in an eventually consistent fashion, and all messages received or sent (or both) for a conversation in a strongly consistent fashion. I had an extra requirement of users not being able to identify eachother, and so messaging used generated UUIDs (the externalId properties).
So in this way, sharding and the 1 write/sec limit applies at a conversation level. You could put unread counters onto the conversation object for each user, or on each message if you needed to (at a contention level it makes no real difference, so whatever makes most sense).
If your users are regularly exceeding 1 message per second per conversation you'll have a lot of other problems to solve beyond just datastore contention, so its probably a good starting point. In the general case, eventual consistency will work very well for this sort of operation (i.e. checking for new messages), so you can lean heavily on that.
There is an application that I'm basically writing with Swing, JDBC and MySQL.
In DB there are tables like Article, Company, Order, Transaction, Client etc.
So also there are java classes which describes them.
User can create, update, delete information about them.
I give an example of my problem. The article characterizes with id, name, price, company, unit. And when user wants to save new article he chooses the company for this article from the list of all companies. This list in perspective could be really big.
Now I could think of two ways to solve this.
When application starts, it connects to the DB and load all the data with which then I will work.
public final class AllInformationController {
public static final Collection<Company> COMPANIES= new HashSet<>(1_000_000);
public static final Collection<Article> ARTICLES= new HashSet<>(1_000_000);
public static final Collection<Order> ORDERS= new HashSet<>(1_000_000);
public static final Collection<Transaction> transactionsHistory= new HashSet<>(10_000_000);
//etc...
private AllInformationController() {}
}
Then if user wants for example to change some Company data (like address or telephone etc.), after doing it the program should update the DB info for that company.
The second approach is basically to connect to the database every time user queries or changes some information. So then I will mostly work with ResultSet's.
I prefer the second way, but just not sure if it's the best one. Think there should be more productive ways to work with data that could be less expensive.
The 2nd approach is better, although there's probably a best case that lies somewhere between them. The 2nd approach here allows multiple applications (or users of the same application) to modify the data at the same time, as the 1st approach may end up using old data if you load all the data at once (especially if a user leaves the application on a while). I would go with the 2nd approach and then figure out what optimizations to make.
Since you think the 1st approach may be usable, I'd assume then you don't have too many users who would use the tool at the same time. If that is the case then, perhaps then you don't need to use any optimizations that the 1st method itself would give you as there's not going to be too much database usage.
When you say you working with ResultSets more often in the 2nd approach than the 1st, well it doesn't need to be that way. You can use the same methods from the 1st approach which translates your data into Java data structures to be used in the 2nd approach.
You already made a very bad decision here:
And when user wants to save new article he chooses the company for this article
from the _list_ of all companies
A list works only reasonably if the number of choices is fairly limited; below 10-20 you may get away with a combo box. For thousands of choices a list is very cumbersome, and the further it grows the slower and more unwieldly chosing from a list becomes.
This is typically solved by some kind of search field (e.g. user types customer number, presses tab and information is fetched), possibly combined with a search dialog (with more search options and a way to select a result found as "it").
Since you will typically be selecting only a few items with a search request, directly quering the DB is usually practical. For a search dialog you may need to artificially limit the number of results (using specific SQL clauses for paging).
I am moving my application from a relational DB to objectify / google app engine.
The application has a relationship which is modelled as follows:
One Message can be sent to many Users. Each User can have many Messages addressed to them.
I need to be able to scan for all Messages addressed to a particular User.
How do I do this with Objectify?
There are a number of ways to do it.
You can save a list of messages in the user object. This will work nicely with your requirement to get all messages addressed to a user, as there is no need to do a query.
You can save a list of users in the message object. To get all the messages addressed to a single user, do a query.
you can save BOTH lists above. Remember, in App Engine there is usually no need to normalize and worry about disk space and duplicates. Almost always build your structure so that queries will be fast.
You can forget about lists, and have Relationship objects just like a table in a relational database. It can still be the decent options in App Engine in some use cases, for example when the lists are just too big (thousands) and will bloat your objects and may not even be query-able.
The most important variable that will determine which approach to take in relation to the query you specified, is how many messages will usually be addressed to a single user, and will there be a maximum number of messages? If we are talking about average of dozens or less and maximum of hundreds, a list of messages in the user object sounds to me like a good option. If we are talking about more, and especially if unlimited, it won't work so well, and you will need to make an actual query.
Beyond the answers already posted I would suggest that you not include a link from User to the Message, for three reasons:
Collections in GAE are hard limited to 5000 items. As soon as your user's inbox exceeds 5k items your app will start throwing exceptions.
There is a performance cost to expanding the quantity of data in an entity; loading a bunch of 500k entities is slower than loading a bunch of 5k entities. Plus your usage of memcache will be less effective since you can fit fewer items in the same space. User objects tend to get loaded a lot.
You can easily hit the transaction rate limit for a single entity (1/s). If 50 people send you a message at the same time, you will have massive concurrency problems as all 50 retry with optimistic failures.
If you can live with a limit of 5000 recipients for a single message, storing the Set of destination keys in the Message (and indexing this set so you can query for all messages of a user) is probably a great solution. There is almost certainly an advantage also to assigning the message a #Parent of the sender.
If you are twitter-like and expect a message to have more than 5k recipients, or if your messages typically have a lot of recipients (thus the message entity is bloated), you may wish to consider the Relation Index Entity pattern that Brett Slatkin talked about in his Google I/O talk from 2009: https://www.youtube.com/watch?v=AgaL6NGpkB8
You have to maintain the relationship on your own. This is because depending on the application it would make sense to let the users exist without messages, or even the opposite.
The suggested approach by Objectify Wiki (https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify Multi-Value Relationship) is to keep a collection(or array) of keys
public class Message
{
#Id String timeStamp;
Key<User>[] destination;
}
public class User
{
#Id String name;
Key<Message>[] inbox;
}
Then if you want to remove all user messages when the user is removed, just remove them from the datastore before the user. Also is exactly the same if you want to add a new message for a particular user.
I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh
First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.
Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B
Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?
If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html
The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.