Cassandra query to get the last row of a table - java

Can anyone help me get the last record of a table in cassandra. I want get the last row based on a primary key. I tried using order by for the key still it shows error as either IN or EQ required for using order by. When I added a IN clause, it shows error. Please explain me how to solve this with an example.

You can't :)
Data is only "ordered" inside of a partition via the clustering keys. You can do order by queries on a clusterin columns assuming that the partition key and all prior clustering keys have an exact match. In other words, if you have a PK of (a,b,c) then you can do where a='asd' and b='cds' and then do range queries on c.
You can specify clusterin order on a partition, so for the latest, if you have clustering as timestamp desc, then simply selecting first will automatically give you "last". For example:
create table timeseries (
event_type text,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, insertion_time)
)
WITH CLUSTERING ORDER BY (insertion_time DESC);
Note, you'll need to specify the parition key (event_type) and the retrieved row will be the "latest" in that partition.
Think however, what it means to be "latest" in a distributed system. Any notion of "latest" is likely to be out of date depending on your use case. This may or may not be acceptable.
If you're looking for "latest" via an arbitrary column that is not a desc clustering key, then I would recommend using Spark to do a fast "map reduce" like computation.

Related

DynamoDB query & partition keys, simple question

something I don't understand about querying a dynamoDB table is that it seems necessary to include something like .withKeyConditionExpression("itemId = :v_id"), but since the partition key uniquely identifies all items in the table, wouldn't you always be searching just one result?
Trying to do something like:
val expression = DynamoDBQueryExpression<PluginItem>()
.withKeyConditionExpression("itemId > 0")
.withFilterExpression("attributes.item_modification_date < :val1")
.withExpressionAttributeValues(eav)
val paginatedResults = queryByExpression(expression)
I'm looking to query and paginate 100,000 items in the table, can anyone point me in the right direction?
partition key uniquely identifies all items in the table
so this isn't accurate. It depends on your table design. However, you will get a lot more fexibility if you design a table with a ParitionKey and a Sort Key. That said, back to your statement. A Primary Key not a partition key uniquely identifies an item in the table. A primary key is a combination of ParitionKey + SortKey(also known as Range Key).
Think of each partition as a bucket.
withKeyConditionExpression("itemId > 0")
this won't work. You can't do those kinds of operations on a partition key. However, you can do those kinds of conditions on a sort key.
a video from 2018 - re:Invent that helped me get a better understanding of Dynamo. I have watched that video quite a few times, especially the last 30 to 20mins of it.
Hope that helps. I have only been working with dynamodb for a few months and there is so much more I have to learn.

Update primary keys without creating duplicate rows?

I'm working on a Java project which needs to be able to alter all the primary keys in a table - and in most cases - some other values in the row as well.
The problem I have is that, if I update a row by selecting by its old primary key (SET pk=new_pk WHERE pk=old_pk) I get duplicate rows (since the old PK value may be equal to another row's new PK value and both rows are then updated).
I figured that in Oracle and some other DBs I might be able to do this with ROWNUM or something similar, but the system should work with most DB systems and right now we can't get this to work for MySQL.
I should add that I don't have access to change the schema of the DB - so, I can't add a column.
What I have tried:
Updating ResultSets directly with RS.updateRow() - this seems to
work, but is very slow.
Hashing the PK's in the table, storing the hash in code and selecting on the hashed PK. This acts sort of as a signature, since a
hashed PK indicates that the row has been read but not yet updated
and I can avoid appropriate rows that way. The issue with this seems
to have been hash collisions as I was getting duplicate PKs.
PS:
I realise this sounds like either a terrible idea, or terrible design, but we really have no choice. The system I'm working on aims to anonymize private data and that may entail changing PKs in some tables. Don't fear, we do account for FK references.
In this case you can use simple update with delta = max Pk from updating table
select delta
select max(pk) as delta from table
and then use it in query
update table SET pk=pk+delta+1
Before this operation you need to disable constraints. And don't forget that you should also update foreign keys.

Most efficient way to determine if a row EXISTS and INSERT into MySQL using java JDBC

I'm looking at trying to query a table in a MySQL database (I have the primary key, which is comprised of two categories, a name and a number but string comparison), such that this table could have anywhere from very few rows to upwards of hundreds of millions. Now, for efficiency, I'm not exactly sure how costly it is to actually do an INSERT query but I have a few options as to go about it:
I could query the database to see if the element EXISTS and then call an INSERT query if it doesn't.
I could try to brute force INSERT into the database and if it succeeds or fails, so be it.
I could initially on program execution, create a cache/store, grab the primary key columns and store them in a Map<String, List<Integer>> and then search the key for if the name exists, then if it does, does the key and value combination in the List<Integer> exists, if it doesn't, then INSERT query the database.
?
Option one really isn't on the table for what I would really implement, just on the list of possible choices. Option two would most likely average better for unique occurrences such that it isn't in the table already. Option three would favour if common occurrences are the case such that a lot are in the cache.
Bearing in mind that option chosen will be iterated over potentially millions of times. Memory usage aside (From option 3), from my calculations it's nothing significant in respect to the capacity available.
Let the database do the work.
You should do the second method. If you don't want to get a failure, you can use on duplicate key update:
insert into t(pk1, pk2, . . . )
values ( . . . )
on duplicate key update set pk1 = values(pk1);
The only purpose of on duplicate key update is to do nothing useful but not return an error.
Why is this the best solution? In a database, a primary key (or columns declared unique) have an index structure. This is efficient for the database to use.
Second, this requires only one round-trip to the database.
Third, there are no race conditions, if you have multiple threads or applications that might be attempting to insert the same record(s).
Fourth, the method with on duplicate key update will work for inserting multiple rows at once. (Without on duplicate key insert, then a multi-value statement would fail if a single row is duplicated.) Combining multiple inserts into a single statement can be another big efficiency.
Your second option is really the right way to go.
Rather than fetching all your result in the third option , you could try using Limit 1 , given the fact that the combination of your name and number form a primary key thus , using limit 1 to fetch the result and then if the result is empty then you can probably insert your desired data. It would lot faster that way.
MySQL has a neat way to perform an special insertion. The INSERT ON DUPLICATE KEY UPDATE is a MySQL extension to the INSERT statement. If you specify the ON DUPLICATE KEY UPDATE option in the INSERT statement and the new row causes a duplicate value in the UNIQUE or PRIMARY KEY index, MySQL performs an update to the old row based on the new values:
INSERT INTO table(column_list)
VALUES(value_list)
ON DUPLICATE KEY UPDATE column_1 = new_value_1, column_2 = new_value_2;

Cassandra select rows ordered by added date

I try to store emails for newsletter mailing app in Cassandra.
Current schema is :
CREATE TABLE emails (
email varchar,
comment varchar,
PRIMARY KEY (email));
I don't know how to get emails ordered by added time(so emails can be processed in parallel on different nodes).
PlayOrm on cassandra can do that sort of stuff under the covers for you as long as you are able to partition your data so you can still scale. You can query into your partitions. The order by is not yet there but a trick is instead to use where time > 0 to get everything after 1970 epoch which forces it to use the time index and then just traverse the cursor backwards for reverse order(or forwards for sorted order).
Cassandra orders on write based on your column comparator. You can't order results using any arbitrary column in your predicate. If you want to retrieve in time order, you must insert with your timestamp as your column name (or the first element in a composite name). You can also create a second CF that would store time-ordered records that you can query if needed. Unfortunately CQL gives the illusion of RDBMS-like query capability, when in reality it's still a column store with the associated query capabilities. My suggestion is to either avoid CQL (and use Thrift-based queries instead) or make sure you understand what it's doing under the covers.

Generate encoding String according to creation order

I need to generate encoding String for each item I inserted into the database. for example:
x00001 for the first item
x00002 for the sencond item
x00003 for the third item
The way I chose to do this is counting the rows. Before I insert the third item, I count against the database, I know there're already 2 rows, so the next encoding is ended with 3.
But there is a problem. If I delete the second item, the forth item will not be the x00004,but x00003.
I can add additional columns to table, to store the next encoding, I don't know if there's other better solutions ?
Most databases support some sort of auto incrementing identity field. This field is normally also setup to be unique, so duplicate ids do not occur.
Consult your database documentation to see how it is done in your database and use that - don't reinvent the wheel when you have a good mechanism in place already.
What you want is SELECT MAX(id) or SELECT MAX(some_function(id)) inside the transaction.
As suggested in Oded's answer a lot of databases have their own methods of providing sequences which are more efficient and depending on the DBMS might support non numeric ids.
Also you could have id broken down into Y and 00001 as separate columns and having both columns make up primary key; then most databases would be able to provide the sequence.
However this leads to the question if your primary key should have a meaning or not; Y suggest that there is some meaning in the part of the key (otherwise you would be content with a plain integer id).

Categories

Resources