something I don't understand about querying a dynamoDB table is that it seems necessary to include something like .withKeyConditionExpression("itemId = :v_id"), but since the partition key uniquely identifies all items in the table, wouldn't you always be searching just one result?
Trying to do something like:
val expression = DynamoDBQueryExpression<PluginItem>()
.withKeyConditionExpression("itemId > 0")
.withFilterExpression("attributes.item_modification_date < :val1")
.withExpressionAttributeValues(eav)
val paginatedResults = queryByExpression(expression)
I'm looking to query and paginate 100,000 items in the table, can anyone point me in the right direction?
partition key uniquely identifies all items in the table
so this isn't accurate. It depends on your table design. However, you will get a lot more fexibility if you design a table with a ParitionKey and a Sort Key. That said, back to your statement. A Primary Key not a partition key uniquely identifies an item in the table. A primary key is a combination of ParitionKey + SortKey(also known as Range Key).
Think of each partition as a bucket.
withKeyConditionExpression("itemId > 0")
this won't work. You can't do those kinds of operations on a partition key. However, you can do those kinds of conditions on a sort key.
a video from 2018 - re:Invent that helped me get a better understanding of Dynamo. I have watched that video quite a few times, especially the last 30 to 20mins of it.
Hope that helps. I have only been working with dynamodb for a few months and there is so much more I have to learn.
Related
We have a Dynamo DB table structure which consists Hash and Range as primary key.
Hash = date.random_number
Range = timestamp
How to get items within X and Y timestamp? Since hash key is attached with random_number, those many times query has to be fired. Is it possible to give multiple hash values and single RangeKeyCondition.
What would be most efficient in terms of cost and time?
Random number range is from 1 to 10.
If I understood correctly, you have a table with the following definition of Primary Keys:
Hash Key : date.random_number
Range Key : timestamp
One thing that you have to keep in mind is that , whether you are using GetItem or Query, you have to be able to calculate the Hash Key in your application in order to successfully retrieve one or more items from your table.
It makes sense to use the random numbers as part of your Hash Key so your records can be evenly distributed across the DynamoDB partitions, however, you have to do it in a way that your application can still calculate those numbers when you need to retrieve the records.
With that in mind, let's create the query needed for the specified requirements. The native AWS DynamoDB operations that you have available to obtain several items from your table are:
Query, BatchGetItem and Scan
In order to use BatchGetItem you would need to know beforehand the entire primary key (Hash Key and Range Key), which is not the case.
The Scan operation will literally go through every record of your table, something that in my opinion is unnecessary for your requirements.
Lastly, the Query operation allows you to retrieve one or more items from a table applying the EQ (equality) operator to the Hash Key and a number of other operators that you can use when you don't have the entire Range Key or would like to match more than one.
The operator options for the Range Key condition are: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN
It seems to me that the most suitable for your requirements is the BETWEEN operator, that being said, let's see how you could build the query with the chosen SDK:
Table table = dynamoDB.getTable(tableName);
String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";
RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);
ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
rangeKeyCondition,
null, //FilterExpression - not used in this example
null, //ProjectionExpression - not used in this example
null, //ExpressionAttributeNames - not used in this example
null); //ExpressionAttributeValues - not used in this example
You might want to look at the following post to get more information about DynamoDB Primary Keys:
DynamoDB: When to use what PK type?
QUESTION: My concern is querying multiple times because of random_number attached to it. Is there a way to combine these queries and hit dynamoDB once ?
Your concern is completely understandable, however, the only way to fetch all the records via BatchGetItem is by knowing the entire primary key (HASH + RANGE) of all records you intend to get. Although minimizing the HTTP roundtrips to the server might seem to be the best solution at first sight, the documentation actually suggests to do exactly what you are doing to avoid hot partitions and uneven use of your provisioned throughput:
Design For Uniform Data Access Across Items In Your Tables
"Because you are randomizing the hash key, the writes to the table on
each day are spread evenly across all of the hash key values; this
will yield better parallelism and higher overall throughput. [...] To
read all of the items for a given day, you would still need to Query
each of the 2014-07-09.N keys (where N is 1 to 200), and your
application would need to merge all of the results. However, you will
avoid having a single "hot" hash key taking all of the workload."
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Here there is another interesting point suggesting the moderate use of reads in a single partition... if you remove the random number from the hash key to be able to get all records in one shot, you are likely to fall on this issue, regardless if you are using Scan, Query or BatchGetItem:
Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity
"Note that it is not just the burst of capacity units the Scan uses
that is a problem. It is also because the scan is likely to consume
all of its capacity units from the same partition because the scan
requests read items that are next to each other on the partition. This
means that the request is hitting the same partition, causing all of
its capacity units to be consumed, and throttling other requests to
that partition. If the request to read data had been spread across
multiple partitions, then the operation would not have throttled a
specific partition."
And lastly, because you are working with time series data, it might be helpful to look into some best practices suggested by the documentation as well:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Can anyone help me get the last record of a table in cassandra. I want get the last row based on a primary key. I tried using order by for the key still it shows error as either IN or EQ required for using order by. When I added a IN clause, it shows error. Please explain me how to solve this with an example.
You can't :)
Data is only "ordered" inside of a partition via the clustering keys. You can do order by queries on a clusterin columns assuming that the partition key and all prior clustering keys have an exact match. In other words, if you have a PK of (a,b,c) then you can do where a='asd' and b='cds' and then do range queries on c.
You can specify clusterin order on a partition, so for the latest, if you have clustering as timestamp desc, then simply selecting first will automatically give you "last". For example:
create table timeseries (
event_type text,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, insertion_time)
)
WITH CLUSTERING ORDER BY (insertion_time DESC);
Note, you'll need to specify the parition key (event_type) and the retrieved row will be the "latest" in that partition.
Think however, what it means to be "latest" in a distributed system. Any notion of "latest" is likely to be out of date depending on your use case. This may or may not be acceptable.
If you're looking for "latest" via an arbitrary column that is not a desc clustering key, then I would recommend using Spark to do a fast "map reduce" like computation.
I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.
I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).
As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.
I'm doing some data migration after some data model refactoring and I'm taking a couple tables with composite primary keys and combining them into a larger table and giving it its own unique primary key. At this point, I've written some SQL to copy the old table data into a new table and assign a primary key using AUTO_INCREMENT. After the migration is done, I remove the AUTO_INCREMENT from the PK field. So, now that's all gravy, but the problem is that I need the hibernate sequence to know what the next available PK will be. We use the TABLE strategy generally for all of our entities and I'd like to stay consistent and avoid using AUTO_INCREMENT and the IDENTITY strategy for future objects. I've gotten away with temporarily setting the respective row in the generated "hibernate_sequences" table to the max id of the newly created table, but this is just a bandaid fix to the problem. Also, this results in the next IDs created to be much larger than the max id. I'm certain this is because I don't understand the HiLo id-assigning mechanism, which is why I'm posting here. Is there a way to set this up so that the Ids will be sequential? Or, where is the code that generates the HiLo value so that I can calculate what it should be to ensure sequential ids?
If I understood you correctly, the problem is that hibernate doesn't generate sequental IDs for you. But that is how hi/lo generator works and I do not understand exactly why you do not like it.
In basic, Hi/lo generator is based on supporting HIGH and LOW values separately. When LOW reaches its limit it is reset and HIGH is incremented. The result key is based on combining HIGH and LOW values together. E.g. assume key is double word and HIGH and LOW are words. HIGH can be left two bytes and LOW right two bytes.
Jumps in ID depend on two factors - the max value for LOW and on event which triggers changing the value of HIGH.
By default in Hibernate, max value for LOW is Short.MAX_VALUE and is reset on each generator initialization. HIGH value is read from the table and incremented on each initialization, also it is incremented when LOW reaches it's upper limit. All this means that on each restart of your application, you will have gaps in IDs.
Looking at the code, it seems, that if you would use value <1 for max_lo, the key will be generated just by incrementing hi value, which is read from DB. You would probably like that behaviour :)
Have a look at the source code for org.hibernate.id.MultipleHiLoPerTableGenerator#generate
Using org.hibernate.id.MultipleHiLoPerTableGenerator#generate, I figured out that my batches were of size 50, and so for my purposes using the max id / 50 + 1 generated a usable number to throw into the sequence to make them as close to sequential as possible.
I need to generate encoding String for each item I inserted into the database. for example:
x00001 for the first item
x00002 for the sencond item
x00003 for the third item
The way I chose to do this is counting the rows. Before I insert the third item, I count against the database, I know there're already 2 rows, so the next encoding is ended with 3.
But there is a problem. If I delete the second item, the forth item will not be the x00004,but x00003.
I can add additional columns to table, to store the next encoding, I don't know if there's other better solutions ?
Most databases support some sort of auto incrementing identity field. This field is normally also setup to be unique, so duplicate ids do not occur.
Consult your database documentation to see how it is done in your database and use that - don't reinvent the wheel when you have a good mechanism in place already.
What you want is SELECT MAX(id) or SELECT MAX(some_function(id)) inside the transaction.
As suggested in Oded's answer a lot of databases have their own methods of providing sequences which are more efficient and depending on the DBMS might support non numeric ids.
Also you could have id broken down into Y and 00001 as separate columns and having both columns make up primary key; then most databases would be able to provide the sequence.
However this leads to the question if your primary key should have a meaning or not; Y suggest that there is some meaning in the part of the key (otherwise you would be content with a plain integer id).