I have the following use case:
In DynamoDB I want to hold a list of user events sorted in descending order, such that I see the latest events in the top. However I am only interested in the latest 1000 events.
At the moment I have a table with the userId as the Hash key, and the timestamp of the user events as range key.
Is there any efficient way to keep the number of items in the range for a given userId to a maximum of 1000, with the latest events first?
I am using the Java low-level API, if that matters.
I guess your table schema is perfect, you can query the table with userId and use option
ScanIndexForward => False
This will sort your data in descending order on range key (which is timestamp)
and you can use option
Limit => 1000
This will display latest 1000 events only.
Hope that helps
Related
something I don't understand about querying a dynamoDB table is that it seems necessary to include something like .withKeyConditionExpression("itemId = :v_id"), but since the partition key uniquely identifies all items in the table, wouldn't you always be searching just one result?
Trying to do something like:
val expression = DynamoDBQueryExpression<PluginItem>()
.withKeyConditionExpression("itemId > 0")
.withFilterExpression("attributes.item_modification_date < :val1")
.withExpressionAttributeValues(eav)
val paginatedResults = queryByExpression(expression)
I'm looking to query and paginate 100,000 items in the table, can anyone point me in the right direction?
partition key uniquely identifies all items in the table
so this isn't accurate. It depends on your table design. However, you will get a lot more fexibility if you design a table with a ParitionKey and a Sort Key. That said, back to your statement. A Primary Key not a partition key uniquely identifies an item in the table. A primary key is a combination of ParitionKey + SortKey(also known as Range Key).
Think of each partition as a bucket.
withKeyConditionExpression("itemId > 0")
this won't work. You can't do those kinds of operations on a partition key. However, you can do those kinds of conditions on a sort key.
a video from 2018 - re:Invent that helped me get a better understanding of Dynamo. I have watched that video quite a few times, especially the last 30 to 20mins of it.
Hope that helps. I have only been working with dynamodb for a few months and there is so much more I have to learn.
We have a Dynamo DB table structure which consists Hash and Range as primary key.
Hash = date.random_number
Range = timestamp
How to get items within X and Y timestamp? Since hash key is attached with random_number, those many times query has to be fired. Is it possible to give multiple hash values and single RangeKeyCondition.
What would be most efficient in terms of cost and time?
Random number range is from 1 to 10.
If I understood correctly, you have a table with the following definition of Primary Keys:
Hash Key : date.random_number
Range Key : timestamp
One thing that you have to keep in mind is that , whether you are using GetItem or Query, you have to be able to calculate the Hash Key in your application in order to successfully retrieve one or more items from your table.
It makes sense to use the random numbers as part of your Hash Key so your records can be evenly distributed across the DynamoDB partitions, however, you have to do it in a way that your application can still calculate those numbers when you need to retrieve the records.
With that in mind, let's create the query needed for the specified requirements. The native AWS DynamoDB operations that you have available to obtain several items from your table are:
Query, BatchGetItem and Scan
In order to use BatchGetItem you would need to know beforehand the entire primary key (Hash Key and Range Key), which is not the case.
The Scan operation will literally go through every record of your table, something that in my opinion is unnecessary for your requirements.
Lastly, the Query operation allows you to retrieve one or more items from a table applying the EQ (equality) operator to the Hash Key and a number of other operators that you can use when you don't have the entire Range Key or would like to match more than one.
The operator options for the Range Key condition are: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN
It seems to me that the most suitable for your requirements is the BETWEEN operator, that being said, let's see how you could build the query with the chosen SDK:
Table table = dynamoDB.getTable(tableName);
String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";
RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);
ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
rangeKeyCondition,
null, //FilterExpression - not used in this example
null, //ProjectionExpression - not used in this example
null, //ExpressionAttributeNames - not used in this example
null); //ExpressionAttributeValues - not used in this example
You might want to look at the following post to get more information about DynamoDB Primary Keys:
DynamoDB: When to use what PK type?
QUESTION: My concern is querying multiple times because of random_number attached to it. Is there a way to combine these queries and hit dynamoDB once ?
Your concern is completely understandable, however, the only way to fetch all the records via BatchGetItem is by knowing the entire primary key (HASH + RANGE) of all records you intend to get. Although minimizing the HTTP roundtrips to the server might seem to be the best solution at first sight, the documentation actually suggests to do exactly what you are doing to avoid hot partitions and uneven use of your provisioned throughput:
Design For Uniform Data Access Across Items In Your Tables
"Because you are randomizing the hash key, the writes to the table on
each day are spread evenly across all of the hash key values; this
will yield better parallelism and higher overall throughput. [...] To
read all of the items for a given day, you would still need to Query
each of the 2014-07-09.N keys (where N is 1 to 200), and your
application would need to merge all of the results. However, you will
avoid having a single "hot" hash key taking all of the workload."
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Here there is another interesting point suggesting the moderate use of reads in a single partition... if you remove the random number from the hash key to be able to get all records in one shot, you are likely to fall on this issue, regardless if you are using Scan, Query or BatchGetItem:
Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity
"Note that it is not just the burst of capacity units the Scan uses
that is a problem. It is also because the scan is likely to consume
all of its capacity units from the same partition because the scan
requests read items that are next to each other on the partition. This
means that the request is hitting the same partition, causing all of
its capacity units to be consumed, and throttling other requests to
that partition. If the request to read data had been spread across
multiple partitions, then the operation would not have throttled a
specific partition."
And lastly, because you are working with time series data, it might be helpful to look into some best practices suggested by the documentation as well:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Can anyone help me get the last record of a table in cassandra. I want get the last row based on a primary key. I tried using order by for the key still it shows error as either IN or EQ required for using order by. When I added a IN clause, it shows error. Please explain me how to solve this with an example.
You can't :)
Data is only "ordered" inside of a partition via the clustering keys. You can do order by queries on a clusterin columns assuming that the partition key and all prior clustering keys have an exact match. In other words, if you have a PK of (a,b,c) then you can do where a='asd' and b='cds' and then do range queries on c.
You can specify clusterin order on a partition, so for the latest, if you have clustering as timestamp desc, then simply selecting first will automatically give you "last". For example:
create table timeseries (
event_type text,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, insertion_time)
)
WITH CLUSTERING ORDER BY (insertion_time DESC);
Note, you'll need to specify the parition key (event_type) and the retrieved row will be the "latest" in that partition.
Think however, what it means to be "latest" in a distributed system. Any notion of "latest" is likely to be out of date depending on your use case. This may or may not be acceptable.
If you're looking for "latest" via an arbitrary column that is not a desc clustering key, then I would recommend using Spark to do a fast "map reduce" like computation.
I try to store emails for newsletter mailing app in Cassandra.
Current schema is :
CREATE TABLE emails (
email varchar,
comment varchar,
PRIMARY KEY (email));
I don't know how to get emails ordered by added time(so emails can be processed in parallel on different nodes).
PlayOrm on cassandra can do that sort of stuff under the covers for you as long as you are able to partition your data so you can still scale. You can query into your partitions. The order by is not yet there but a trick is instead to use where time > 0 to get everything after 1970 epoch which forces it to use the time index and then just traverse the cursor backwards for reverse order(or forwards for sorted order).
Cassandra orders on write based on your column comparator. You can't order results using any arbitrary column in your predicate. If you want to retrieve in time order, you must insert with your timestamp as your column name (or the first element in a composite name). You can also create a second CF that would store time-ordered records that you can query if needed. Unfortunately CQL gives the illusion of RDBMS-like query capability, when in reality it's still a column store with the associated query capabilities. My suggestion is to either avoid CQL (and use Thrift-based queries instead) or make sure you understand what it's doing under the covers.
I'm using the new experimental taskqueue for java appengine and I'm trying to create tasks that aggregate statistics in my datastore. I'm trying to count the number of UNIQUE values within all the entitities (of a certain type) in my datastore. More concretely, say entity of type X has a field A. I want to count the NUMBER of unique values of A in my datastore.
My current approach is to create a task which queries for the first 10 entities of type X, creating a hashtable to store the unique values of A in, then passing this hashtable to the next task as the payload. This next task will count the next 10 entities and so on and so forth until I've gone through all the entities. During the execution of the last task, I'll count the number of keys in my hashtable (that's been passed from task to task all along) to find the total number of unique values of A.
This works for a small number of entities in my data store. But I'm worried that this hashtable will get too big once I have a lot of unique values. What is the maximum allowable size for the payload of an appengine task?????
Can you suggest any alternative approaches?
Thanks.
According to the docs, the maximum task object size is 100K.
"Can you suggest any alternative approaches?".
Create an entity for each unique value, by constructing a key based on the value and using Model.get_or_insert. Then Query.count up the entities in batches of 1000 (or however many you can count before your request times out - more than 10), using the normal paging tricks.
Or use code similar to that given in the docs for get_or_insert to keep count as you go - App Engine transactions can be run more than once, so a memcached count incremented in the transaction would be unreliable. There may be some trick around that, though, or you could keep the count in the datastore provided that you aren't doing anything too unpleasant with entity parents.
This may be too late, but perhaps it can be of use. First, anytime you have a remote chance of wanting to walk serially through a set of entities, suggest using either a date_created or date_modified auto_update field which is indexed. From this point you can create a model with a TextProperty to store your hash table using json.dumps(). All you need to do is pass the last date processed, and the model id for the hash table entity. Do a query with date_created later than the last date, json_load() the TextProperty, and accumulate the next 10 records. Could get a bit more sophisticated (e.g. handle date_created collisions by utilizing the parameters passed and a little different query approach). Add a 1 second countdown to the next task to avoid any issues with updating the hash table entity too quickly. HTH, -stevep