Is modeling infinite-scale relationships in NoSQL / BigTable (GAE) possible?

Is modeling infinite-scale relationships in NoSQL / BigTable (GAE) possible? - java

My team is writing an application with GAE (Java) that has led me to question the scalability of entity relationship modeling (specifically many-to-many) in object oriented databases like BigTable.
The preferred solution for modeling unowned one-to-many and many-to-many relationships in the App Engine Datastore (see Entity Relationships in JDO) seems to be list-of-keys. However, Google warns:
"There are a few limitations to implementing many-to-many
relationships this way. First, you must explicitly retrieve the values
on the side of the collection where the list is stored since all you
have available are Key objects. Another more important one is that you
want to avoid storing overly large lists of keys..."
Speaking of overly large lists of keys, if you attempt to model this way and assume that you are storing one Long for each key then with a per-entity limit of 1MB the theoretical maximum number of relationships per entity is ~130k. For a platform who's primary advantage is scalabililty, that's really not that many relationships. So now we are looking at possibly sharding entities which require more than 130k relationships.
A different approach (Relationship Model) is outlined in the article Modeling Entity Relationships as part of the Mastering the datastore series in the AppEngine developer resources. However, even here Google warns about the performance of relational models:
"However, you need to be very careful because traversing the
connections of a collection will require more calls to the datastore.
Use this kind of many-to-many relationship only when you really need
to, and do so with care to the performance of your application."
So by now you are asking: 'Why do you need more than 130k relationships per-entity?' Well I'm glad you asked. Let's take, for example, a CMS application with say 1 million users (Hey I can dream right?!)
Users can upload content and share it with:
1. public
2. individuals
3. groups
4. any combination
Now someone logs in, and navigates to a dashboard that shows new uploads from people they are connected to in any group. This dashboard should include public content, and content shared specifically with this user or a group this user is a member of. Not too bad right? Let's dig into it.
public class Content {
private Long id;
private Long authorId;
private List<Long> sharedWith; //can be individual ids or group ids
}
Now my query to get everything an id is allowed to see might look like this:
List<Long> idsThatGiveMeAccess = new ArrayList<Long>();
idsThatGiveMeAccess.add(myId);
idsThatGiveMeAccess.add(publicId); //Let's say that sharing with 0L makes it public
for (Group g : groupsImIn)
idsThatGiveMeAccess.add(g.getId());
List<Long> authorIdsThatIWantToSee = new ArrayList<Long>();
//Add a bunch of authorIds
Query q = new Query("Content")
.addFilter("authorId", Query.FilterOperator.IN, authorIdsThatIWantToSee)
.addFilter("sharedWith", Query.FilterOperator.IN, idsThatGiveMeAccess);
Obviously I've already broken several rules. Namely, using two IN filters will blow up. Even a single IN filter at any size approaching the limits we are talking about would blow up. Aside from all that, let's say I want to limit and page through the results... no no! You can't do that if you use an IN filter. I can't think of any way to do this operation in a single query - which means you can't paginate it without extensive read-time processing and managing multiple cursors.
So here are the tools I can think of for doing this: denormalization, sharding, or relationship entities. However even with these concepts I don't see how it is possible to model this data in a way that could scale. Obviously it's possible. Google and others do it all the time. I just can't see how. Can anyone shed any light on how to model this or point me toward any good resources for cms-style access control based on NoSQL DB?

storing a list of ids as a property wont scale.
Why not simply store a new object for each new relationship? (Like in sql).
That object will store for your cms two properties: The id of the shared item and the user id. If its shared with 1000 users you will have 1000 of these. Querying it for a given user is trivial. Listing permissions for a given item or a list of what a user has shared with them is easy too.

Related

Load entire tables including relationships into memory with JPA

I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.
I'm using Wildfly 14 and JPA/Hibernate.
Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:
em.createQuery("SELECT e FROM Entity e").size();
After that, every object should be availabe in the transaction and thus be available via:
em.find(Entity.class, id);
But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.
How can I efficiently load the whole content of the required tables including
the relationships and make sure I got everything / there will be no further DB calls?
What I already tried:
FetchMode.EAGER: Still too many single selects / object graph too complex
EntityGraphs: Same as FetchMode.EAGER
Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities
2nd Level / Query Cache: Not working, probably the same problem as em.find
One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.
Edit:
My plan is to load and manage the entire data in a #Singleton bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).

Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. Take a look at Hibernate 2nd Level Cache. Some things to check for since we don't have your code:
#Cacheable annotation will clue Hibernate in so that the entity is cacheable
Configure 2nd level caching to use something like ehcache, and set the maximum memory elements to something big enough to fit your working set into it
Make sure you're not accidentally using multiple sessions in your code.
If you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. This will give you more control of how things are executed. This may even be a better fit for something like Hadoop. Without more information it's hard to say what direction would be best for you.

I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. Consider that you described 5 million records. What is the average length per record? 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. Probably more like 5000 bytes average and that's 25 gB of memory. You need to think about what you're asking for.
If you want it cached you should do that yourself or better yet just use the results when you have them. If you want a memory based data access you should look at a technology specifically for that. http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first.
If you are trying to be database efficient then you should just understand what your doing and design and test carefully.

Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example.
The biggest problem are #OneToMany/#ManyToMany-relations:
#Entity
public class Employee {
#Id
#Column(name="EMP_ID")
private long id;
...
#OneToMany(mappedBy="owner")
private List<Phone> phones;
...
}
#Entity
public class Phone {
#Id
private long id;
...
#ManyToOne
#JoinColumn(name="OWNER_ID")
private Employee owner;
...
}
FetchType.EAGER
If defined as FetchType.EAGER and the query SELECT e FROM Employee e Hibernate generates the SQL statement SELECT * FROM EMPLOYEE and right after it SELECT * FROM PHONE WHERE OWNER_ID=? for every single Employee loaded, commonly known as 1+n problem.
I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones, which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID.
The problem is, this won't work for a complex data model with ~20 tables involved.
FetchType.LAZY
If defined as FetchType.LAZY the query SELECT e FROM Employee e will just load all Employees as Proxies, loading the related Phones only when accessing phones, which in the end will lead into the 1+n problem as well.
To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p. But when accessing phones Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=?, because Hibernate doesn't know that there are already all Phones in its current session.
Even when using 2nd level cache, the statement will be executed on the DB because Phone is indexed by its primary key in the 2nd level cache and not by OWNER_ID.
Conclusion
There is no mechanism like "just load all data" in Hibernate.
It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC.
EDIT:
I just found a solution which works very well. I defined all relevant #ManyToMany and #OneToMany as FetchType.EAGER combinded with #Fetch(FetchMode.SUBSELECT) and all #ManyToOne with #Fetch(FetchMode.JOIN), which results in an acceptable loading time. Next to adding javax.persistence.Cacheable(true) to all entities I added org.hibernate.annotations.Cache to every relevant collection, which enables collection caching in the 2nd level cache. I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via #Singleton EJB combined with #Startup on server start / deploy. Now I have 100% control over the cache, there are no further DB calls until I manually clear it.

Database design decision

I'm going to design a merchant application. After merchants are registered with the system they will be able to add their products, discounts, price etc. And there is smart mobile apps to visit each merchants and their products.
So regarding the database (hope to use MySQL) design I have three options.
Use one database and use single table structure to maintain catalog with column called merchant_id.
Use one database and create same table structure per each merchant with unique prefix in table name.
Use separate database with table structure for each merchant when they registers with the system. In this case will maintain a master DB to keep merchant's db information.
We are developing a single application to cater all the merchant's and customers requests and there will be a lot of merchants and customers interact with the system.
Currently we are planning to use Spring MVC and Spring Data JPA.
So I'm troubling with get the correct decision in terms of scalability and maintainability or etc. Your expertise advices/recommendation are highly appreciated.

1) Use one database and use single table structure to maintain catalog
with column called merchant_id.
This is the easiest route to take.
Pros
Low maintenance. Any changes to the DB make it to one schema / database.
Cons
Does not scale beyond X merchants and N transactions per second on the database.
2) Use one database and create same table structure per each merchant
with unique prefix in table name.
This is a hybrid model of sorts, and writing the SQL and trying to track which prefix belongs to which app can be messy if you do not handle it correctly.
Pros
Can scale a little better
Cons
Maintenance overhead on each table; such as adding a new column called created to the table user requires you to modify user_111 and user_121 etc
You can possibly mix up queries by attempting to join user_111 with access_121.
Use separate database with table structure for each merchant when they
registers with the system. In this case will maintain a master DB to
keep merchant's db information.
This provides the most scale but also gives you the most maintenance overhead.
Pros
Can scale each database individually based on the type of customer you have and the traffic they provide.
Cons
High maintenance for each database because individual parameters are tweaked at the DB level too (SSD / Shared buffers / fsync time with the disk / write caches etc ).
If you're starting out by designing a system where you will not know what kind of traffic it will attract on day 1, choose #1. Should the traffic be unexpectedly large, you can always scale vertically and place the high traffic customers on another db later (through a hashing mechanism that puts the customers into db buckets )
If you expect the site traffic to be large enough and already have capacity planned out for the customers, go for #3. You must bear the brunt of the maintenance overhead, but at least you get to scale each database based on the traffic that hits it.
I'm not a fan of #2 since I've seen that approach let down some products that implemented it.

In my opinion option 1 is the way to go. The benefit I see is that you can work over this table with aggregate queries to perform calculations over each merchant, e.g. your admin view wants to see the top 20 merchants with the highest number of products uploaded.
The drawback you might see in option 1 is that this table will be huge. This can be addressed with partitioning techniques and properly chosen indexes.
Option 2 and 3 are not nice because they introduce redundancy in your schema.
Also you can consider that with JPA your entity classes naturally map to tables, but I think table prefixes per merchant would be painful to hack with JPA. This is also a +1 for option 1.
What benefits do you see in option 2 and 3? I don't really see any advantage, only drawbacks.

Modeling a system with users who can send messages to each other in GAE datastore

I was wondering whether you could help me work out a way to model the following in a GAE datastore such that it is scalable and can be updated frequently. I thought I had a solution which I expressed in this question but whilst waiting for replies I realise that it might be overly complicated. I have explained below why I have kept it as a separate question.
Problem:
Building a system with users who can send many messages to each other. Each user must be able to retrieve their messages - like online chat. Would like to avoid contention when possibly a user may receive many messages over a short time.
Solution 1:
As mentioned here I am wondering whether a sharded list can be used to implement this. By this I mean have messages stored as entity objects and sender and receiver store the keys of these objects (the messages sent between them) in a list. I thought of sharding because a user who receives many messages would have to update the list frequently and a sharding approach could have prevented datastore contention.
Problem - what happens when the list of keys to say a user's received messages gets large? Will appending to it not become slow? I could split the list over several entities but this would take careful thought on allocation schemes and ways of retrieval. Willing to do this if it is the best way.
Alternative approach:
Store messages as entity objects (as above) but this time have them store a properties which are indexed (date, from, to, etc). Retrieve messages for a user using queries (date greater than..., from=... etc). This could work well but I worry - will all the indexing degrade as they will grow extremely large with many users sending many messages? Seems like it will degrade into an SQL like system.
Any thoughts?
I have read about how to model complex relations in the GAE docs but they are using python for an example and I am having trouble abstracting the overall design pattern.
Many thanks to anyone with input on this
PS at the moment using the low level datastore directly.

I have created a system similar to this before. The way I chose to implement it was that i created a Conversation entity, that was the parent for many Message entities. A conversation had two participants (although you could do more), each of which was the key to a User entity.
Something like this (assuming ofy)
#Entity public class Conversation {
#Id Long id;
#Index Key<User> participant1;
#Index Key<User> participant2;
#Index String participant1ExternalId;
#Index String participant2ExternalId;
}
#Entity public class Message {
#Id Long id;
#Parent Ref<Conversation> conversation;
#Index String senderExternalId;
#Index String recipientExternalId;
String message;
}
In this way, you can query all conversations for a participant in an eventually consistent fashion, and all messages received or sent (or both) for a conversation in a strongly consistent fashion. I had an extra requirement of users not being able to identify eachother, and so messaging used generated UUIDs (the externalId properties).
So in this way, sharding and the 1 write/sec limit applies at a conversation level. You could put unread counters onto the conversation object for each user, or on each message if you needed to (at a contention level it makes no real difference, so whatever makes most sense).
If your users are regularly exceeding 1 message per second per conversation you'll have a lot of other problems to solve beyond just datastore contention, so its probably a good starting point. In the general case, eventual consistency will work very well for this sort of operation (i.e. checking for new messages), so you can lean heavily on that.

many to many relationship using objectify?

I am moving my application from a relational DB to objectify / google app engine.
The application has a relationship which is modelled as follows:
One Message can be sent to many Users. Each User can have many Messages addressed to them.
I need to be able to scan for all Messages addressed to a particular User.
How do I do this with Objectify?

There are a number of ways to do it.
You can save a list of messages in the user object. This will work nicely with your requirement to get all messages addressed to a user, as there is no need to do a query.
You can save a list of users in the message object. To get all the messages addressed to a single user, do a query.
you can save BOTH lists above. Remember, in App Engine there is usually no need to normalize and worry about disk space and duplicates. Almost always build your structure so that queries will be fast.
You can forget about lists, and have Relationship objects just like a table in a relational database. It can still be the decent options in App Engine in some use cases, for example when the lists are just too big (thousands) and will bloat your objects and may not even be query-able.
The most important variable that will determine which approach to take in relation to the query you specified, is how many messages will usually be addressed to a single user, and will there be a maximum number of messages? If we are talking about average of dozens or less and maximum of hundreds, a list of messages in the user object sounds to me like a good option. If we are talking about more, and especially if unlimited, it won't work so well, and you will need to make an actual query.

Beyond the answers already posted I would suggest that you not include a link from User to the Message, for three reasons:
Collections in GAE are hard limited to 5000 items. As soon as your user's inbox exceeds 5k items your app will start throwing exceptions.
There is a performance cost to expanding the quantity of data in an entity; loading a bunch of 500k entities is slower than loading a bunch of 5k entities. Plus your usage of memcache will be less effective since you can fit fewer items in the same space. User objects tend to get loaded a lot.
You can easily hit the transaction rate limit for a single entity (1/s). If 50 people send you a message at the same time, you will have massive concurrency problems as all 50 retry with optimistic failures.
If you can live with a limit of 5000 recipients for a single message, storing the Set of destination keys in the Message (and indexing this set so you can query for all messages of a user) is probably a great solution. There is almost certainly an advantage also to assigning the message a #Parent of the sender.
If you are twitter-like and expect a message to have more than 5k recipients, or if your messages typically have a lot of recipients (thus the message entity is bloated), you may wish to consider the Relation Index Entity pattern that Brett Slatkin talked about in his Google I/O talk from 2009: https://www.youtube.com/watch?v=AgaL6NGpkB8

You have to maintain the relationship on your own. This is because depending on the application it would make sense to let the users exist without messages, or even the opposite.
The suggested approach by Objectify Wiki (https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify Multi-Value Relationship) is to keep a collection(or array) of keys
public class Message
{
#Id String timeStamp;
Key<User>[] destination;
}
public class User
{
#Id String name;
Key<Message>[] inbox;
}
Then if you want to remove all user messages when the user is removed, just remove them from the datastore before the user. Also is exactly the same if you want to add a new message for a particular user.

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh

First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.

Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B

Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?

If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html

The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.