Cassandra select rows ordered by added date - java

I try to store emails for newsletter mailing app in Cassandra.
Current schema is :
CREATE TABLE emails (
email varchar,
comment varchar,
PRIMARY KEY (email));
I don't know how to get emails ordered by added time(so emails can be processed in parallel on different nodes).

PlayOrm on cassandra can do that sort of stuff under the covers for you as long as you are able to partition your data so you can still scale. You can query into your partitions. The order by is not yet there but a trick is instead to use where time > 0 to get everything after 1970 epoch which forces it to use the time index and then just traverse the cursor backwards for reverse order(or forwards for sorted order).

Cassandra orders on write based on your column comparator. You can't order results using any arbitrary column in your predicate. If you want to retrieve in time order, you must insert with your timestamp as your column name (or the first element in a composite name). You can also create a second CF that would store time-ordered records that you can query if needed. Unfortunately CQL gives the illusion of RDBMS-like query capability, when in reality it's still a column store with the associated query capabilities. My suggestion is to either avoid CQL (and use Thrift-based queries instead) or make sure you understand what it's doing under the covers.

Related

Fetch all Columns from Dynamo db based on GSI index and range key

i have only two projections defined at the dynamo db GSI index level. but to create the expected response i need to get other columns as well from dynamo db .
Lets say there are 20 columns in my table and only two mentioned in global secondary index.How can i achieve this using GSI and loading data from master table .
Do i need to user Query Requests or another approach i think of is pull data from index and then search on primary table .
This is my existing code :
public List<DynamoDBObject> getData(String gsiHashKey) {
DynamoDBObject dynamoDBObject= new DynamoDBObject();
command.setgsiHashKey(gsiHashKey);
final DynamoDBQueryExpression<DynamoDBObject> queryExpression =
new DynamoDBQueryExpression<>();
queryExpression.setIndexName("gsi_index_name");
queryExpression.setHashKeyValues(dynamoDBObject);
return mapper.query(DynamoDBObject.class,queryExpression)
}
Please suggest best way to achieve this.
As you noted, in GSI if you chose not to project all the base table's columns onto the index table, then these other columns are not available when querying the index. The reason for this isn't laziness by the implementators, is efficiency: In GSI, the index table is distributed throughout the DynamoDB cluster in a different way from the base table, so when reading index table data there is no efficient way to also read from the base table at the same time. By the way, this is exactly where LSI differs from GSI - in LSI the index and base tables are co-located, and can be read together, so DynamoDB does give you a way to request also unprojected columns.
So I think you are left with two options. One is to use the BatchGetItem request to read the base-table data after having read the index data. Note that when your query on the index, you can always get back the base-table key attributes, so you can use those to read the complete items from the base table. BatchGetItem is probably the most efficient way to do those reads, instead of retrieving items one by one with GetItem.
The second option is, of course, to project more base attributes - or even all of them - onto the index table. This will increase your storage and possibly read and write costs, so whether you want to do this or not depends on your application. In some specific cases it even makes sense to have two indexes of the same attribute, with different amount of projected attributes.

Most efficient way to determine if a row EXISTS and INSERT into MySQL using java JDBC

I'm looking at trying to query a table in a MySQL database (I have the primary key, which is comprised of two categories, a name and a number but string comparison), such that this table could have anywhere from very few rows to upwards of hundreds of millions. Now, for efficiency, I'm not exactly sure how costly it is to actually do an INSERT query but I have a few options as to go about it:
I could query the database to see if the element EXISTS and then call an INSERT query if it doesn't.
I could try to brute force INSERT into the database and if it succeeds or fails, so be it.
I could initially on program execution, create a cache/store, grab the primary key columns and store them in a Map<String, List<Integer>> and then search the key for if the name exists, then if it does, does the key and value combination in the List<Integer> exists, if it doesn't, then INSERT query the database.
?
Option one really isn't on the table for what I would really implement, just on the list of possible choices. Option two would most likely average better for unique occurrences such that it isn't in the table already. Option three would favour if common occurrences are the case such that a lot are in the cache.
Bearing in mind that option chosen will be iterated over potentially millions of times. Memory usage aside (From option 3), from my calculations it's nothing significant in respect to the capacity available.
Let the database do the work.
You should do the second method. If you don't want to get a failure, you can use on duplicate key update:
insert into t(pk1, pk2, . . . )
values ( . . . )
on duplicate key update set pk1 = values(pk1);
The only purpose of on duplicate key update is to do nothing useful but not return an error.
Why is this the best solution? In a database, a primary key (or columns declared unique) have an index structure. This is efficient for the database to use.
Second, this requires only one round-trip to the database.
Third, there are no race conditions, if you have multiple threads or applications that might be attempting to insert the same record(s).
Fourth, the method with on duplicate key update will work for inserting multiple rows at once. (Without on duplicate key insert, then a multi-value statement would fail if a single row is duplicated.) Combining multiple inserts into a single statement can be another big efficiency.
Your second option is really the right way to go.
Rather than fetching all your result in the third option , you could try using Limit 1 , given the fact that the combination of your name and number form a primary key thus , using limit 1 to fetch the result and then if the result is empty then you can probably insert your desired data. It would lot faster that way.
MySQL has a neat way to perform an special insertion. The INSERT ON DUPLICATE KEY UPDATE is a MySQL extension to the INSERT statement. If you specify the ON DUPLICATE KEY UPDATE option in the INSERT statement and the new row causes a duplicate value in the UNIQUE or PRIMARY KEY index, MySQL performs an update to the old row based on the new values:
INSERT INTO table(column_list)
VALUES(value_list)
ON DUPLICATE KEY UPDATE column_1 = new_value_1, column_2 = new_value_2;

Strange Cassandra ReadTimeoutExceptions, depending on which client is querying

I have a cluster of three Cassandra nodes with more or less default configuration. On top of that, I have a web layer consisting of two nodes for load balancing, both web nodes querying Cassandra all the time. After some time, with the data stored in Cassandra becoming non-trivial, one and only one of the web nodes started getting ReadTimeoutException on a specific query. The web nodes are identical in every way.
The query is very simple (? is placeholder for date, usually a few minutes before the current moment):
SELECT * FROM table WHERE time > ? LIMIT 1 ALLOW FILTERING;
The table is created with this query:
CREATE TABLE table (
user_id varchar,
article_id varchar,
time timestamp,
PRIMARY KEY (user_id, time));
CREATE INDEX articles_idx ON table(article_id);
When it times-out, the client waits for a bit more than 10s, which, not surprisingly, is the timeout configured in cassandra.yaml for most connects and reads.
There are a couple of things that are baffling me:
the query only timeouts when one of the web nodes execute it - one of the nodes always fail, one of the nodes always succeed.
the query returns instantaneously when I run it from cqlsh (although it seems it only hits one node when I run it from there)
there are other queries issued which take 2-3 minutes (a lot longer than the 10s timeout) that do not timeout at all
I cannot trace the query in Java because it times out. Tracing the query in cqlsh didn't provide much insight. I'd rather not change the Cassandra timeouts as this is production system and I'd like to exhaust non-invasive options first. The Cassandra nodes all have plenty of heap, their heap is far from full, and GC times seem normal.
Any ideas/directions will be much appreciated, I'm totally out of ideas. Cassandra version is 2.0.2, using com.datastax.cassandra:cassandra-driver-core:2.0.2 Java client.
A few things I noticed:
While you are using time as a clustering key, it doesn't really help you because your query is not restricting by your partition key (user_id). Cassandra only orders by clustering keys within a partition. So right now your query is pulling back the first row which satisfies your WHERE clause, ordered by the hashed token value of user_id. If you really do have tens of millions of rows, then I would expect this query to pull back data from the same user_id (or same select few) every time.
"although it seems it only hits one node when I run it from there" Actually, your queries should only hit one node when you run them. Introducing network traffic into a query makes it really slow. I think the default consistency in cqlsh is ONE. This is where Carlo's idea comes into play.
What is the cardinality of article_id? Remember, secondary indexes work the best on "middle-of-the-road" cardinality. High (unique) and low (boolean) are both bad.
The ALLOW FILTERING clause should not be used in (production) application-side code. Like ever. If you have 50 million rows in this table, then ALLOW FILTERING is first pulling all of them back, and then trimming down the result set based on your WHERE clause.
Suggestions:
Carlo might be on to something with the suggestion of trying a different (lower) consistency level. Try setting a consistency level of ONE in your application and see if that helps.
Either perform an ALLOW FILTERING query, or a secondary index query. They both suck, but definitely do not do both together. I would not use either. But if I had to pick, I would expect a secondary index query to suck less than an ALLOW FILTERING query.
To solve this adequately at the scale in which you are describing, I would duplicate the data into a query table. As it looks like you are concerned with organizing time-sensitive data, and in getting the most-recent data. A query table like this should do it:
CREATE TABLE tablebydaybucket (
user_id varchar,
article_id varchar,
time timestamp,
day_bucket varchar,
PRIMARY KEY (day_bucket , time))
WITH CLUSTERING ORDER BY (time DESC);
Populate this table with your data, and then this query will work:
SELECT * FROM tablebydaybucket
WHERE day_bucket='20150519' AND time > '2015-05-19 15:38:49-0500' LIMIT 1;
This will partition your data by day_bucket, and cluster your data by time. This way, you won't need ALLOW FILTERING or a secondary index. Also your query is guaranteed to hit only one node, and Cassandra will not have to pull all of your rows back and apply your WHERE clause after-the-fact. And clustering on time in DESCending order, helps your most-recent rows come back quicker.

Cassandra query to get the last row of a table

Can anyone help me get the last record of a table in cassandra. I want get the last row based on a primary key. I tried using order by for the key still it shows error as either IN or EQ required for using order by. When I added a IN clause, it shows error. Please explain me how to solve this with an example.
You can't :)
Data is only "ordered" inside of a partition via the clustering keys. You can do order by queries on a clusterin columns assuming that the partition key and all prior clustering keys have an exact match. In other words, if you have a PK of (a,b,c) then you can do where a='asd' and b='cds' and then do range queries on c.
You can specify clusterin order on a partition, so for the latest, if you have clustering as timestamp desc, then simply selecting first will automatically give you "last". For example:
create table timeseries (
event_type text,
insertion_time timestamp,
event blob,
PRIMARY KEY (event_type, insertion_time)
)
WITH CLUSTERING ORDER BY (insertion_time DESC);
Note, you'll need to specify the parition key (event_type) and the retrieved row will be the "latest" in that partition.
Think however, what it means to be "latest" in a distributed system. Any notion of "latest" is likely to be out of date depending on your use case. This may or may not be acceptable.
If you're looking for "latest" via an arbitrary column that is not a desc clustering key, then I would recommend using Spark to do a fast "map reduce" like computation.

How to achieve a bounded model in DynamoDB?

I have the following use case:
In DynamoDB I want to hold a list of user events sorted in descending order, such that I see the latest events in the top. However I am only interested in the latest 1000 events.
At the moment I have a table with the userId as the Hash key, and the timestamp of the user events as range key.
Is there any efficient way to keep the number of items in the range for a given userId to a maximum of 1000, with the latest events first?
I am using the Java low-level API, if that matters.
I guess your table schema is perfect, you can query the table with userId and use option
ScanIndexForward => False
This will sort your data in descending order on range key (which is timestamp)
and you can use option
Limit => 1000
This will display latest 1000 events only.
Hope that helps

Categories

Resources