I have a data structure that consists of a collection, called "Polls." "Polls" has several documents that have randomly generated ID's. Within those documents, there is an additional collection set called "answers." Users vote on these polls, with the votes all written to the "answers" subcollection. I use the .runTransaction() method on the "answers" node, with the idea that this subscollection (for any given poll) is constantly being updated and written to by users.
I have been reading about social media structure for Firestore. However, I recently came across a new feature for Firestore, the "array_contains" query option.
While the post references above discusses a "following" feed for social media structure, I had a different idea in mind. I envision users writing (voting) to my main poll node, therefore creating another "following" node and also having users write to this node to update poll vote counts (using a cloud function) seems horribly inefficient since I would have to constantly be copying from the main node, where votes are being counted.
Would the "array_contains" query be another practical option for social media structure scalability? My thought is:
If user A follows user B, write to a direct array child in my "Users" node called "followers."
Before any poll is created by user B, user's B's device reads "followers" array from Firestore to gain a list of all users following and populates them in the client side, in an Array object
Then, when user B writes a new poll, add that "followers" array to the poll, so each new poll from user B will have an array attached to it that contains all ID's of the users following.
What are the limitations on the "array_contains" query? Is it practical to have an array stored in Firebase that contains thousands of users / followers?
Would the "array_contains" query be another practical option for social media structure scalability?
Yes of course. This the reason why Firebase creators added this feature.
Seeing your structure, I think you can give it a try, but to responde to your question.
What are the limitations on the "array_contains" query?
There is no limitations regarding what type of data do you store.
Is it practical to have an array stored in Firebase that contains thousands of users / followers?
Is not about practical or not, is about other type of limitations. The problem is that the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much. So in your case, if you would store only ids, I think that will be no problem. But IMHO, as your array getts bigger, be careful about this limitation.
If you are storing large amount of data in arrays and those arrays should be updated by lots of users, there is another limitation that you need to take care of. So you are limited to 1 write per second on every document. So if you have a situation in which a lot of users al all trying to write/update data to the same documents all at once, you might start to see some of this writes to fail. So, be careful about this limitation too.
I did a real-time polls system, here is my implementation:
I made a polls collection where each document has a unique identifier, a title and an array of answers.
Also, each document has a subcollection called answers where each answer has a title and the total of distributed counters in their own shards subcollection.
Example :
polls/
[pollID]
- title: 'Some poll'
- answers: ['yolo' ...]
answers/
[answerID]
- title: 'yolo'
- num_shards: 2
shards/
[1]
- count: 2
[2]
- count: 16
I made another collection called votes where each document is a composite key of userId_pollId so I can keep tracking if the user has already voted a poll.
Each document holds the pollId, the userId, the answerId...
When a document is created, I trigger a Cloud Function that grab the pollId and the answerId and I increment a random shard counter in this answerId's shards subcollection, using a transaction.
Finaly, on the client-side, I reduce the count value of each shards of each answers of a poll to calculate the total.
For the following stuff, you can do the same thing using a middle-man collection called "following", where each document is a composite key of userAid_userBid so you can track easily which user is following another user without breaking firestore's limits.
Related
I have some persistent data in the rdms and csv files (they are independent objects, but I wanted to mention it because they are in different mediums,
I can not go with what rdbms provides, actually I do not want to do a trip to database for the next hour in even the data gets old). I need to store the data in memory for performance benefits and query (only read, no other operation) the objects based on multiple columns of it, and refresh the data every hour.
In my case ,what is a good way to store and query in-memory objects other than implementing my own object store and querying methods? For instance, can you provide an example/link to replace the sql query as
select * from employees where emplid like '%input%' or surname like '%input%' or email like '%input%';
Sorry for the dummy query but it explains what kind of queries are possible.
Go find yourself a key store implementation with the features you want. Use your Query string as the key and the result as the value. https://github.com/ben-manes/caffeine Has quite a few features including record timeouts (like an hour).
For my own work, I use a LRU key store (limited to X entries) containing objects with the timeout information and I manually decide if the record is stale or not before I use it. LRU is basically a linked-list which moves "read" records to the head of the list and drops the tail when records are added beyond the maximum desired size. This keeps the popular records in the store longer.
I am trying to reduce the datastore cost by using Projection. I have read that a Projection Query costs only 1 Read Operation but in my case the Projection cost goes more than 1. Here is the code:
Query<Finders> q = ofy().load().type(Finders.class).project("Password","Country");
for(Finders finder:q)
{
resp.getWriter().println(finder.getCountry()+" "+finder.getPassword());
}
On executing this, the q object contains 6 items and to retrieve these 6 items it takes 6 Read operations as shown in Appstats.
Can anyone tell me what's wrong over here ?
To read all items (with a single read operation if they will all fit) call .list() on the query, to get a List<Finders>. You chose to iterate on the query instead, and that's quite likely to not rely on a single, possibly huge read from the datastore, but parcel things out more.
Where projections enter the picture is quite different: if you have entities with many fields, or some fields that are very large, and in a certain case you know you need only a certain subset of fields (esp. if it's one not requiring "some fields that are very large"), then a projection is a very wise idea because it avoids reading stuff you don't need.
That makes it more likely that a certain fetch of (e.g) 10 entities will take a single datastore read -- there are byte limits on how much can come from a single datastore read, so, if by carefully picking and choosing the fields you actually require, you're reading only (say) 10k per entity, rather than (say) 500k per entity, then clearly you may well need fewer reads from the datastore.
But if you don't do one big massive read with .list(), but an entity-by-entity read by iteration, then most likely you'll still get multiple reads -- essentially, by iterating, you've said you want that!-)
I am looking for a good design pattern for sharding a list in Google App Engine. I have read about and implemented sharded counters as described in the Google Docs here but I am now trying to apply the same principle to a list. Below is my problem and possible solution - please can I get your input?
Problem:
A user on my system could receive many messages kind of like a online chat system. I'd like the server to record all incoming messages (they will contain several fields - from, to, etc). However, I know from the docs that updating the same entity group often can result in an exception caused by datastore contention. This could happen when one user receives many messages in a short time thus causing his entity to be written to many times. So what about abstracting out the sharded counter example above:
Define say five entities/entity groups
for each message to be added, pick one entity at random and append the message to it writing it back to the store,
To get list of messages, read all entities in and merge...
Ok some questions on the above:
Most importantly, is this the best way to go about things or is there a more elegant/more efficient design pattern?
What would be a efficient way to filter the list of messages by one of the fields say everything after a certain date?
What if I require a sharded set instead? Should I read in all entities and check if the new item already exists on every write? Or just add it as above and then remove duplicates whenever the next request comes in to read?
why would you want to put all messages in 1 entity group ?
If you don't specify a ancestor, you won't need sharding, but the end user might see some lagging when querying the messages due to eventual consistency.
Depends if that is an acceptable tradeoff.
What is an index in Elasticsearch? Does one application have multiple indexes or just one?
Let's say you built a system for some car manufacturer. It deals with people, cars, spare parts, etc. Do you have one index named manufacturer, or do you have one index for people, one for cars and a third for spare parts? Could someone explain?
Good question, and the answer is a lot more nuanced than one might expect. You can use indices for several different purposes.
Indices for Relations
The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.
MySQL => Databases => Tables => Rows/Columns
ElasticSearch => Indices => Types => Documents with Properties
An ElasticSearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties (columns).
So in your car manufacturing scenario, you may have a SubaruFactory index. Within this index, you have three different types:
People
Cars
Spare_Parts
Each type then contains documents that correspond to that type (e.g. a Subaru Imprezza doc lives inside of the Cars type. This doc contains all the details about that particular car).
Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]
So to retrieve the Subaru document, I may do this:
$ curl -XGET localhost:9200/SubaruFactory/Cars/SubaruImprezza
.
Indices for Logging
Now, the reality is that Indices/Types are much more flexible than the Database/Table abstractions we are used to in RDBMs. They can be considered convenient data organization mechanisms, with added performance benefits depending on how you set up your data.
To demonstrate a radically different approach, a lot of people use ElasticSearch for logging. A standard format is to assign a new index for each day. Your list of indices may look like this:
logs-2013-02-22
logs-2013-02-21
logs-2013-02-20
ElasticSearch allows you to query multiple indices at the same time, so it isn't a problem to do:
$ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search=q:"Error Message"
Which searches the logs from the last two days at the same time. This format has advantages due to the nature of logs - most logs are never looked at and they are organized in a linear flow of time. Making an index per log is more logical and offers better performance for searching.
.
Indices for Users
Another radically different approach is to create an index per user. Imagine you have some social networking site, and each users has a large amount of random data. You can create a single index for each user. Your structure may look like:
Zach's Index
Hobbies Type
Friends Type
Pictures Type
Fred's Index
Hobbies Type
Friends Type
Pictures Type
Notice how this setup could easily be done in a traditional RDBM fashion (e.g. "Users" Index, with hobbies/friends/pictures as types). All users would then be thrown into a single, giant index.
Instead, it sometimes makes sense to split data apart for data organization and performance reasons. In this scenario, we are assuming each user has a lot of data, and we want them separate. ElasticSearch has no problem letting us create an index per user.
#Zach's answer is valid for elasticsearch 5.X and below. Since elasticsearch 6.X Type has been deprecated and will be completely removed in 7.X. Quoting the elasticsearch docs:
Initially, we spoke about an “index” being similar to a “database” in an SQL database, and a “type” being equivalent to a “table”.
This was a bad analogy that led to incorrect assumptions.
Further to explain, two columns with the same name in SQL from two different tables can be independent of each other. But in an elasticsearch index that is not possible since they are backed by the same Lucene field. Thus, "index" in elasticsearch is not quite same as a "database" in SQL. If there are any same fields in an index they will end up having conflicts of field types. To avoid this the elasticsearch documentation recommends storing index per document type.
Refer: Removal of mapping types
An index is a data structure for storing the mapping of fields to the corresponding documents. The objective is to allow faster searches, often at the expense of increased memory usage and preprocessing time.
The number of indexes you create is a design decision that you should take according to your application requirements. You can have an index for each business concept... You can an index for each month of the year...
You should invest some time getting acquainted with lucene and elasticsearch concepts.
Take a look at the introductory video and to this one with some data design patterns
Above one is too detailed in very short it could be defined as
Index: It is a collection of different type of documents and document properties. Index also uses the concept of shards to improve the performance. For example, a set of document contains data of a social networking application.
Answer from tutorialpoints.com
Since index is collection of different type of documents as per question depends how you want to categorize.
Do you have one index named manufacturer?
Yes , we will keep one document with manufacturer thing.
do you have one index for people, one for cars and a third for spare parts? Could someone explain?
Think of instance car given by same manufacturer to many people driving it on road .So there could be many indices depending upon number of use.
If we think deeply we will found except first question all are invalid ones.
Elastic-search documents are much different that SQL docs or csv or spreadsheet docs ,from one indices and by good powerful query language you can create millions type of data categorised documents in CSV style.
Due to its blazingly fast and indexed capability we create one index only for one customer , from that we create many type of documnets as per our need .
For example:
All old people using same model.Or One Old people using all model .
Permutation is inifinite.
I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.
I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).
As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.