Complexity of a query in the Google datastore

Complexity of a query in the Google datastore - java

I have an Android app where users will be able to send private messages to each other. (for instance: A sends a message to B and C and the three of them may comment that message)
I use google app engine and the google datastore with Java. (framework Objectify)
I have created a Member entity and a Message entity which contains a ArrayList<String> field, representing the recipients'ids list. (that is to say the key field of the Member entity)
In order for a user to get all the messages where he is one of the recipients, I was planning on loading each Message entity on the datastore and then select them by checking if the ArrayList<String> field contains the user's id. However, considering there may be hundred of thousands messages stored, I was wondering if that is even possible and if that wouldn't take too much time?

The time to fetch results from the datastore only relates to the number of Entities retrieved, not to the total number of Entities stored because every query MUST use an index. That's exactly what makes the datastore so scalable.
You will have to limit the number of messages retrieved per call and use a Cursor to fetch the next batch. You can send the cursor over to the Android client by converting it to a websafe string, so the client can indicate the starting point for the next request.

Related

Using "array-contains" Query for Cloud Firestore Social Media Structure

I have a data structure that consists of a collection, called "Polls." "Polls" has several documents that have randomly generated ID's. Within those documents, there is an additional collection set called "answers." Users vote on these polls, with the votes all written to the "answers" subcollection. I use the .runTransaction() method on the "answers" node, with the idea that this subscollection (for any given poll) is constantly being updated and written to by users.
I have been reading about social media structure for Firestore. However, I recently came across a new feature for Firestore, the "array_contains" query option.
While the post references above discusses a "following" feed for social media structure, I had a different idea in mind. I envision users writing (voting) to my main poll node, therefore creating another "following" node and also having users write to this node to update poll vote counts (using a cloud function) seems horribly inefficient since I would have to constantly be copying from the main node, where votes are being counted.
Would the "array_contains" query be another practical option for social media structure scalability? My thought is:
If user A follows user B, write to a direct array child in my "Users" node called "followers."
Before any poll is created by user B, user's B's device reads "followers" array from Firestore to gain a list of all users following and populates them in the client side, in an Array object
Then, when user B writes a new poll, add that "followers" array to the poll, so each new poll from user B will have an array attached to it that contains all ID's of the users following.
What are the limitations on the "array_contains" query? Is it practical to have an array stored in Firebase that contains thousands of users / followers?

Would the "array_contains" query be another practical option for social media structure scalability?
Yes of course. This the reason why Firebase creators added this feature.
Seeing your structure, I think you can give it a try, but to responde to your question.
What are the limitations on the "array_contains" query?
There is no limitations regarding what type of data do you store.
Is it practical to have an array stored in Firebase that contains thousands of users / followers?
Is not about practical or not, is about other type of limitations. The problem is that the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much. So in your case, if you would store only ids, I think that will be no problem. But IMHO, as your array getts bigger, be careful about this limitation.
If you are storing large amount of data in arrays and those arrays should be updated by lots of users, there is another limitation that you need to take care of. So you are limited to 1 write per second on every document. So if you have a situation in which a lot of users al all trying to write/update data to the same documents all at once, you might start to see some of this writes to fail. So, be careful about this limitation too.

I did a real-time polls system, here is my implementation:
I made a polls collection where each document has a unique identifier, a title and an array of answers.
Also, each document has a subcollection called answers where each answer has a title and the total of distributed counters in their own shards subcollection.
Example :
polls/
[pollID]
- title: 'Some poll'
- answers: ['yolo' ...]
answers/
[answerID]
- title: 'yolo'
- num_shards: 2
shards/
[1]
- count: 2
[2]
- count: 16
I made another collection called votes where each document is a composite key of userId_pollId so I can keep tracking if the user has already voted a poll.
Each document holds the pollId, the userId, the answerId...
When a document is created, I trigger a Cloud Function that grab the pollId and the answerId and I increment a random shard counter in this answerId's shards subcollection, using a transaction.
Finaly, on the client-side, I reduce the count value of each shards of each answers of a poll to calculate the total.
For the following stuff, you can do the same thing using a middle-man collection called "following", where each document is a composite key of userAid_userBid so you can track easily which user is following another user without breaking firestore's limits.

Is checksum a good way to see if table has been modified in MySQL?

I'm currently developing an application in Java that connects to a MySQL database using JDBC, and displays records in jTable. The application is going to be run by more than one user at a time and I'm trying to implement a way to see if the table has been modified. EG if user one modifies a column such as stock level, and then user two tries to access the same record tries to change it based on level before user one interacts.
At the moment I'm storing the checksum of the table that's being displayed as a variable and when a user tries to modify a record it will do a check whether the stored checksum is the same as the one generated before the edit.
As I'm new to this I'm not sure if this a correct way to do it or not; as I have no experience in this matter.

Calculating the checksum of an entire table seems like a very heavy-handed solution and definitely something that wouldn't scale in the long term. There are multiple ways of handling this but the core theme is to do as little work as possible to ensure that you can scale as the number of users increase. Imagine implementing the checksum based solution on table with million rows continuously updated by hundreds of users!
One of the solutions (which requires minimal re-work) would be to "check" the stock name against which the value is updated. In the background, you'll fire across a query to the table to see if the data for "that particular stock" has been updated after the table was populated. If yes, you can warn the user or mark the updated cell as dirty to indicate that that value has changed. The problem here is that the query won't be fired off till the user tries to save the updated value. Or you could poll the database to avoid that but again hardly an efficient solution.
As a more robust solution, I would recommend using a database which implements native "push notifications" to all the connected clients. Redis is a NoSQL database which comes to mind for this.
Another tried and tested technique would be to forgo direct database connection and use a middleware layer like a messaging queue (e.g. RabbitMQ). Message queues enable design of systems which communicate using message. So for e.g. every update the stock value in the JTable would be sent across as a message to an "update database queue". Once the update is done, a message would be sent across to a "update notification queue" to which all clients would be connected. This will enable all of them to know that the value of a given stock has been updated and act accordingly. The advantage to this solution is that you get to keep your existing stack (Java, MySQL) and can implement notifications without polling the DB and killing it.

Checksum is a way to see if data has changed.
Anyway I would suggest you store a column "last_update_date", this column is supposed to be always updated at every update of the record.
So you juste have to store this date (precision date time) and do the check with that.
You can also add a column version number : a simple counter incremented by 1 at each update.
Note:
You can add a trigger on update for updating last_update_date, it should be 100% reliable, maybe you don't need a trigger if you control all updates.

When using in network communication:
A checksum is a count of the number of bits in a transmission unit
that is included with the unit so that the receiver can check to see
whether the same number of bits arrived. If the counts match, it's
assumed that the complete transmission was received.
So it can be translated to check 2 objects are different, your approach is correct.

Building a sharded list in Google App Engine

I am looking for a good design pattern for sharding a list in Google App Engine. I have read about and implemented sharded counters as described in the Google Docs here but I am now trying to apply the same principle to a list. Below is my problem and possible solution - please can I get your input?
Problem:
A user on my system could receive many messages kind of like a online chat system. I'd like the server to record all incoming messages (they will contain several fields - from, to, etc). However, I know from the docs that updating the same entity group often can result in an exception caused by datastore contention. This could happen when one user receives many messages in a short time thus causing his entity to be written to many times. So what about abstracting out the sharded counter example above:
Define say five entities/entity groups
for each message to be added, pick one entity at random and append the message to it writing it back to the store,
To get list of messages, read all entities in and merge...
Ok some questions on the above:
Most importantly, is this the best way to go about things or is there a more elegant/more efficient design pattern?
What would be a efficient way to filter the list of messages by one of the fields say everything after a certain date?
What if I require a sharded set instead? Should I read in all entities and check if the new item already exists on every write? Or just add it as above and then remove duplicates whenever the next request comes in to read?

why would you want to put all messages in 1 entity group ?
If you don't specify a ancestor, you won't need sharding, but the end user might see some lagging when querying the messages due to eventual consistency.
Depends if that is an acceptable tradeoff.

many to many relationship using objectify?

I am moving my application from a relational DB to objectify / google app engine.
The application has a relationship which is modelled as follows:
One Message can be sent to many Users. Each User can have many Messages addressed to them.
I need to be able to scan for all Messages addressed to a particular User.
How do I do this with Objectify?

There are a number of ways to do it.
You can save a list of messages in the user object. This will work nicely with your requirement to get all messages addressed to a user, as there is no need to do a query.
You can save a list of users in the message object. To get all the messages addressed to a single user, do a query.
you can save BOTH lists above. Remember, in App Engine there is usually no need to normalize and worry about disk space and duplicates. Almost always build your structure so that queries will be fast.
You can forget about lists, and have Relationship objects just like a table in a relational database. It can still be the decent options in App Engine in some use cases, for example when the lists are just too big (thousands) and will bloat your objects and may not even be query-able.
The most important variable that will determine which approach to take in relation to the query you specified, is how many messages will usually be addressed to a single user, and will there be a maximum number of messages? If we are talking about average of dozens or less and maximum of hundreds, a list of messages in the user object sounds to me like a good option. If we are talking about more, and especially if unlimited, it won't work so well, and you will need to make an actual query.

Beyond the answers already posted I would suggest that you not include a link from User to the Message, for three reasons:
Collections in GAE are hard limited to 5000 items. As soon as your user's inbox exceeds 5k items your app will start throwing exceptions.
There is a performance cost to expanding the quantity of data in an entity; loading a bunch of 500k entities is slower than loading a bunch of 5k entities. Plus your usage of memcache will be less effective since you can fit fewer items in the same space. User objects tend to get loaded a lot.
You can easily hit the transaction rate limit for a single entity (1/s). If 50 people send you a message at the same time, you will have massive concurrency problems as all 50 retry with optimistic failures.
If you can live with a limit of 5000 recipients for a single message, storing the Set of destination keys in the Message (and indexing this set so you can query for all messages of a user) is probably a great solution. There is almost certainly an advantage also to assigning the message a #Parent of the sender.
If you are twitter-like and expect a message to have more than 5k recipients, or if your messages typically have a lot of recipients (thus the message entity is bloated), you may wish to consider the Relation Index Entity pattern that Brett Slatkin talked about in his Google I/O talk from 2009: https://www.youtube.com/watch?v=AgaL6NGpkB8

You have to maintain the relationship on your own. This is because depending on the application it would make sense to let the users exist without messages, or even the opposite.
The suggested approach by Objectify Wiki (https://code.google.com/p/objectify-appengine/wiki/IntroductionToObjectify Multi-Value Relationship) is to keep a collection(or array) of keys
public class Message
{
#Id String timeStamp;
Key<User>[] destination;
}
public class User
{
#Id String name;
Key<Message>[] inbox;
}
Then if you want to remove all user messages when the user is removed, just remove them from the datastore before the user. Also is exactly the same if you want to add a new message for a particular user.

Using memcache in google app engine

I have created an app using GAE. I am expecting 100k request daily. At present for each request app need to lookup 4 tables and 8 diff columns before performing needed task.
These 4 tables are my master tables having 5k,500, 200 and 30 records. It is under 1 MB (The limit).
Now I want to put my master records in memcache for faster access and reduce RPC call. When any user update master I'll replace the memcache object.
I need community suggestion about this.
Is it OK to change the current design?
How can I put 4 master table data in memcache?
Here is how application works currently
100 of users access same application page.
They provide a unique identification token and 3 more parameters (let's say p1, p2 and p3).
My servlet receives the request.
Application fetches user table by token and and check enable state.
Application fetches another table (say department) and checks for p1 existence. If exists check enable status.
If above return true, a service table is queried based on parameter p2 to check whether this service is enabled or not for this user and check Service EndDate.
Based on p3 length another table is checked for availability.

You shouldn't be thinking in terms of inserting tables into memcache. Instead, use an 'optimistic cache' strategy: any time you need to perform an operation that you want to cache, first attempt to look it up in memcache, and if that fails, fetch it from the datastore, then store in memcache. Here's an example:
def cached_get(key):
entity = memcache.get(str(key))
if not entity:
entity = db.get(key)
memcache.set(str(key), entity)
return entity
Note, though, that caching individual entities is fairly low return - the datastore is fairly fast at doing fetches. Caching query results or rendered pages will give a much better improvement in speed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.