Cassandra Read/Get Performance

Cassandra Read/Get Performance - java

My Cassandra table has following schema
CREATE TABLE cachetable1 (
id text,
lsn text,
lst timestamp,
PRIMARY KEY ((id))
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='{"keys":"ALL", "rows_per_partition":"ALL"}' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.000000 AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Above table contains 221 Million rows (approx. 16 GB data). The CassandraDaemon is running with 4GB heap space and I have configured 4 GB memory for row cache. I am try to run select queries from my java code like this
for(int i = 0; i < 1000; i ++)
{
int id = random.nextInt(20000000 - 0) + 0;
for(j = id; j <= id + 100; j++)
{
ls.add(j+"");
}
Statement s = QueryBuilder.select("lst","lsn").from("ks1" , "cachetable1").where(QueryBuilder.in("id",ls.toArray()));
s.setFetchSize(100);
ResultSet rs=sess.execute( s );
List<Row> lsr=rs.all();
for(Row rw:lsr)
{
//System.out.println(rw.toString());
count++;
}
ls.clear();
}
In above code, I am trying to fetch 0.1 Million records. But the read/get performance is very bad. It takes 400-500 seconds to fetch 0.1 Million rows. Is there any better way to read/get records from Cassandra through Java? Is some tuning required other than row cache size and Cassandra heap size?

You appear to want to retrieve your data in 100 row chunks. This sounds like a good candidate for a clustering column.
Change your schema to use an id as the partition key and a chunk index as a clustering column, i.e. PRIMARY KEY ( (id), chunk_idx ). When you insert the data, you will have to figure out how to map your single indexes into an id and chunk_idx (e.g. perhaps do a modulo 100 on one of your values to generate a chunk_idx).
Now when you query for an id and don't specify a chunk_idx, Cassandra can efficiently return all 100 rows with one disk read on the partition. And you can still do range queries and retrievals of single rows within the partition by specifying the chunk_idx if you don't always want to read a whole chunk of rows.
So your mistake is you are generating 100 random partition reads with each query, and this will hit all the nodes and require a separate disk read for each one. Remember that just because you are querying for sequential index numbers doesn't mean the data is stored close together, and with Cassandra it is exactly the opposite, where sequential partition keys are likely stored on different nodes.
The second mistake you are making is you are executing the query synchronously (i.e. you are issuing the query and waiting for the request to finish before you issue any more queries). What you want to do is use a thread pool so that you can have many queries running in parallel, or else use the executeAsync method in a single thread. Since your query is not efficient, waiting for the 100 random partition reads to complete is going to be a long wait, and a lot of the highly pipelined Cassandra capacity is going to be sitting there twiddling its thumbs waiting for something to do. If you are trying to maximize performance, you want to keep all the nodes as busy as possible.
Another thing to look into is using the TokenAwarePolicy when connecting to your cluster. This allows each query to go directly to a node that has a replica of the partition rather than to a random node that might have to act as a coordinator and get the data via an extra hop. And of course using consistency level ONE on reads is faster than higher consistency levels.
The row cache size and heap size are not the source of your problem, so that's the wrong path to go down.

I am going to guess that this is your culprit:
.where(QueryBuilder.in("id",ls.toArray()))
Use of the IN relation in the WHERE clause is widely known to be non-performant. In some case, performing many parallel queries can be faster than using one IN query. From the DataStax SELECT documentation:
When not to use IN
...Using IN can degrade performance because usually many nodes must be
queried. For example, in a single, local data center cluster with 30
nodes, a replication factor of 3, and a consistency level of
LOCAL_QUORUM, a single key query goes out to two nodes, but if the
query uses the IN condition, the number of nodes being queried are
most likely even higher, up to 20 nodes depending on where the keys
fall in the token range.
So you have two options (assuming that living with this poor-performing query isn't one of them):
Rewrite your code to make multiple, parallel requests for each id.
Revisit your data model to see if you have another value that it makes sense to key your data by. For instance, if all of your ids in ls happen to share a common column value that is unique to them, that's a good candidate for a primary key. Basically, find another way to query all of the ids that you are looking for, and build a specific query table to support that.

Related

Which is the optimized way to query using aerospike client?

I have a set (set1)
Bins :
bin1 (PK = key1)
bin2 (PK = key1)
bin3 (PK = key2)
bin4 (PK = key2)
Which is more optimized way(in terms of query time, cpu usage, failure cases for 1 client call vs 2 client calls) for querying the data from aerospike client from the below 2 approaches:
Approach 1 : Make 1 get call using aeropsike client which has bins = [bin1, bin2, bin3, bin4] and keys = [key1, key2]
Approach 2 : Make 2 aerospike client get calls. First call will have bins = [bin1, bin2] and keys = [key1] and Second call will have bins = [bin3, bin4] and keys = [key2]
I find Approach 2 more cleaner, since in Approach 1 we will try to get the record for all combinations (e.g. : bin1 with key2 as primary key) and it will be extra computation and the primary key set can be large. But the disadvantage of Approach 2 is two Aerospike client calls.

A. Batch reads vs. multiple single reads
This is kind of a false choice. Yes, you could make a batch call for [key1, key2] (1), and you shouldn't specify bin1, bin2, bin3, bin4, just get the full records without selecting bins. Or you could make two independent get() calls, one for key1, one for key2 (2).
However, there's no reason you need to read key1, wait for the result, then read key2. You can read them with a synchronous get(key1) in one thread, and a synchronous get(key2) in another thread. The Java client can handle multi-threaded use. Alternatively, you can async get(key1) and immediately async get(key2).
Batch reads (such as in (1)) are not as efficient as single reads when the number of records is smaller than at least the number of nodes in the cluster. The records are evenly distributed, so if you have a 4 node cluster, and you make a batch request with 4 keys, you end up with parallel sub-batches of roughly 1 record per-node. The overhead associated with batch-reads isn't worth it when that's the case. See more about batch index in the docs and the knowledge base FAQ - batch-index tuning parameters. The FAQ - Differences between getting single record versus batch should answer your question.
B. The number of records in an Aerospike database doesn't impact read performance!
You are worried that "the primary key set can be large". That is not a problem at all for Aerospike. In fact, one of the best things about Aerospike is that getting a single record from a database with 1 million records or one with 1 trillion records is pretty much the same big-O computational cost.
Each record has a 64 byte metadata entry in the primary index. The primary index is spread evenly across the nodes of the cluster, because data distribution in Aerospike is extremely even. Each node stores an even share of the partitions, out of 4096 logical partitions for each namespace in the cluster. The partitions are represented as a collection of red-black binary trees (sprigs) with a hash table leading to the correct sprig.
To find any record the client hashes its key into a 20 byte digest. Using 12 bits of the digest the client finds the partition ID, looks it up in the partition map it holds locally, and finds the correct node. Reading the record is now a single hop to the correct node. On that node, a service thread picks up the call from a channel of the network card, looks it up in the correct partition (again, finding the partition ID from the digest is a simple O(1) operation). It hops directly to the correct sprig (also O(1)) and then does a simple O(n log n) binary tree lookup for the record's metadata. Now the service thread knows exactly where to find the record in storage, with a single read IO. I explained this read flow in more detail here (though in version 4.7 transaction queues and threads were removed; the service thread does all the work ).
Another point is that the time spent looking up record metadata in the index is orders of magnitude less than getting the record from storage.
So, the number of records in the cluster doesn't change how fast it takes to read a random record, from a data set of any size.
I wrote an article Aerospike Modeling: User Profile Store that shows how this fact is leveraged to make sub-millisecond reads at millions of transactions-per-second from a petabyte scale data store.

Select all records from offset to limit using a postgres index

I want to get all data from offset to limit from a table with about 40 columns and 1.000.000 rows. I tried to index the id column via postgres and get the result of my select query via java and an entitymanager.
My query needs about 1 minute to get my results, which is a bit too long. I tried to use a different index and also limited my query down to 100 but still it needs this time. How can i fix it up? Do I need a better index or is anything wrong with my code?
CriteriaQuery<T> q = entityManager.getCriteriaBuilder().createQuery(Entity.class);
TypedQuery<T> query = entityManager.createQuery(q);
List<T> entities = query.setFirstResult(offset).setMaxResults(limit).getResultList();

Right now you probably do not utilize the index at all. There is some ambiguity how a hibernate limit/offset will translate to database operations (see this comment in the case of postgres). It may imply overhead as described in detail in a reply to this post.
If you have a direct relationship of offset and limit to the values of the id column you could use that in a query of the form
SELECT e
FROM Entity
WHERE id >= offset and id < offset + limit
Given the number of records asked for is significantly smaller than the total number of records int the table the database will use the index.
The next thing is, that 40 columns is quite a bit. If you actually need significantly less for your purpose, you could define a restricted entity with just the attributes required and query for that one. This should take out some more overhead.
If you're still not within performance requirements you could chose to take a jdbc connection/query instead of using hibernate.
Btw. you could log the actual sql issued by jpa/hibernate and use it to get an execution plan from postgress, this will show you what the query actually looks like and if an index will be utilized or not. Further you could monitor the database's query execution times to get an idea which fraction of the processing time is consumed by it and which is consumed by your java client plus data transfer overhead.

There also is a technique to mimick the offset+limit paging, using paging based on the page's first record's key.
Map<Integer, String> mapPageTopRecNoToKey = new HashMap<>();
Then search records >= page's key and load page size + 1 records to find the next page.
Going from page 1 to page 5 would take a bit more work but would still be fast.
This of course is a terrible kludge, but the technique at that time indeed was a speed improvement on some databases.
In your case it would be worth specifying the needed fields in jpql: select e.a, e.b is considerably faster.

Cassandra size limit per partition key

I have this table in cassandra :
CREATE TABLE adress (
adress_id uuid,
adress_name text,
key1 text,
key2 text,
key3 text,
key4 text,
effective_date timestamp,
value text,
active boolean,
PRIMARY KEY ((adress_id, adress_name), key1, key2, key3, key4, effective_date)
)
As I can understand, cassandra will distribute the data of the table adress based on the partition key which is (adress_id, adress_name).
There is a risk when I try to insert too many data where they share the same (adress_id,adress_name)..
I would like to check before inserting data, the check happen like this:
how many data do I already have in cassandra with the couple (adress_id, adress_name), lets suppose it's 5MO.
I need to check that the size of data that I'm trying to insert don't exceed the Cassandra limit per partition key minus the existing data in cassandra.
My question is how to query cassandra to get the size of data with the couple (adress_id, adress_name).
After that what is the size limit of a partition key in Cassandra.

As Alex Ott noted above, you should spend more time on the data model to avoid the possibility of huge partitions in the first place, by organizing your data differently, or by artificially splitting partitions to more pieces (e.g., time-series data often splits data into a separate partition each day, for example).
It is technically possible to figure out the existing size of a partition, but it will never be efficient. To understand why, you need to recall how Cassandra stores data. The content of a single partition isn't always stored in the same sstable (on-disk file) - data for the same partition may be spread across multiple files. One file may have a few rows, another file may have a few more rows, a third file may delete or modify some old rows, and so on. To figure out the length of the partition, Cassandra would need to read all this all data, merge it together, and measure the size of the result. Cassandra does not normally do this on writes - it just writes the new update to memory (and eventually a new sstable), without reading the old data first. This is what makes writes in Cassandra so fast - and your idea to read the entire partition before each write will drastically slow them down.
Finally while Cassandra does not handle huge partitions very well, there is no inherent reason why it never could if the developers wanted to solve this issue. The developers of the Cassandra clone Scylla a worried about this issue, and are working to improve it, but even in Scylla the handling of huge partitions isn't perfect yet. But eventually it will be. Almost - there will always be a limit for the size of a single partition (which, by definition, is stored on a single node) as the size of a single disk. This limit too may become a serious problem if your data model is really broken and you can end up with a terabyte in a single partition.

DynamoDB Scan Query and BatchGet

We have a Dynamo DB table structure which consists Hash and Range as primary key.
Hash = date.random_number
Range = timestamp
How to get items within X and Y timestamp? Since hash key is attached with random_number, those many times query has to be fired. Is it possible to give multiple hash values and single RangeKeyCondition.
What would be most efficient in terms of cost and time?
Random number range is from 1 to 10.

If I understood correctly, you have a table with the following definition of Primary Keys:
Hash Key : date.random_number
Range Key : timestamp
One thing that you have to keep in mind is that , whether you are using GetItem or Query, you have to be able to calculate the Hash Key in your application in order to successfully retrieve one or more items from your table.
It makes sense to use the random numbers as part of your Hash Key so your records can be evenly distributed across the DynamoDB partitions, however, you have to do it in a way that your application can still calculate those numbers when you need to retrieve the records.
With that in mind, let's create the query needed for the specified requirements. The native AWS DynamoDB operations that you have available to obtain several items from your table are:
Query, BatchGetItem and Scan
In order to use BatchGetItem you would need to know beforehand the entire primary key (Hash Key and Range Key), which is not the case.
The Scan operation will literally go through every record of your table, something that in my opinion is unnecessary for your requirements.
Lastly, the Query operation allows you to retrieve one or more items from a table applying the EQ (equality) operator to the Hash Key and a number of other operators that you can use when you don't have the entire Range Key or would like to match more than one.
The operator options for the Range Key condition are: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN
It seems to me that the most suitable for your requirements is the BETWEEN operator, that being said, let's see how you could build the query with the chosen SDK:
Table table = dynamoDB.getTable(tableName);
String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";
RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);
ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
rangeKeyCondition,
null, //FilterExpression - not used in this example
null, //ProjectionExpression - not used in this example
null, //ExpressionAttributeNames - not used in this example
null); //ExpressionAttributeValues - not used in this example
You might want to look at the following post to get more information about DynamoDB Primary Keys:
DynamoDB: When to use what PK type?
QUESTION: My concern is querying multiple times because of random_number attached to it. Is there a way to combine these queries and hit dynamoDB once ?
Your concern is completely understandable, however, the only way to fetch all the records via BatchGetItem is by knowing the entire primary key (HASH + RANGE) of all records you intend to get. Although minimizing the HTTP roundtrips to the server might seem to be the best solution at first sight, the documentation actually suggests to do exactly what you are doing to avoid hot partitions and uneven use of your provisioned throughput:
Design For Uniform Data Access Across Items In Your Tables
"Because you are randomizing the hash key, the writes to the table on
each day are spread evenly across all of the hash key values; this
will yield better parallelism and higher overall throughput. [...] To
read all of the items for a given day, you would still need to Query
each of the 2014-07-09.N keys (where N is 1 to 200), and your
application would need to merge all of the results. However, you will
avoid having a single "hot" hash key taking all of the workload."
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html
Here there is another interesting point suggesting the moderate use of reads in a single partition... if you remove the random number from the hash key to be able to get all records in one shot, you are likely to fall on this issue, regardless if you are using Scan, Query or BatchGetItem:
Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity
"Note that it is not just the burst of capacity units the Scan uses
that is a problem. It is also because the scan is likely to consume
all of its capacity units from the same partition because the scan
requests read items that are next to each other on the partition. This
means that the request is hitting the same partition, causing all of
its capacity units to be consumed, and throttling other requests to
that partition. If the request to read data had been spread across
multiple partitions, then the operation would not have throttled a
specific partition."
And lastly, because you are working with time series data, it might be helpful to look into some best practices suggested by the documentation as well:
Understand Access Patterns for Time Series Data
For each table that you create, you specify the throughput
requirements. DynamoDB allocates and reserves resources to handle your
throughput requirements with sustained low latency. When you design
your application and tables, you should consider your application's
access pattern to make the most efficient use of your table's
resources.
Suppose you design a table to track customer behavior on your site,
such as URLs that they click. You might design the table with hash and
range type primary key with Customer ID as the hash attribute and
date/time as the range attribute. In this application, customer data
grows indefinitely over time; however, the applications might show
uneven access pattern across all the items in the table where the
latest customer data is more relevant and your application might
access the latest items more frequently and as time passes these items
are less accessed, eventually the older items are rarely accessed. If
this is a known access pattern, you could take it into consideration
when designing your table schema. Instead of storing all items in a
single table, you could use multiple tables to store these items. For
example, you could create tables to store monthly or weekly data. For
the table storing data from the latest month or week, where data
access rate is high, request higher throughput and for tables storing
older data, you could dial down the throughput and save on resources.
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Strange Cassandra ReadTimeoutExceptions, depending on which client is querying

I have a cluster of three Cassandra nodes with more or less default configuration. On top of that, I have a web layer consisting of two nodes for load balancing, both web nodes querying Cassandra all the time. After some time, with the data stored in Cassandra becoming non-trivial, one and only one of the web nodes started getting ReadTimeoutException on a specific query. The web nodes are identical in every way.
The query is very simple (? is placeholder for date, usually a few minutes before the current moment):
SELECT * FROM table WHERE time > ? LIMIT 1 ALLOW FILTERING;
The table is created with this query:
CREATE TABLE table (
user_id varchar,
article_id varchar,
time timestamp,
PRIMARY KEY (user_id, time));
CREATE INDEX articles_idx ON table(article_id);
When it times-out, the client waits for a bit more than 10s, which, not surprisingly, is the timeout configured in cassandra.yaml for most connects and reads.
There are a couple of things that are baffling me:
the query only timeouts when one of the web nodes execute it - one of the nodes always fail, one of the nodes always succeed.
the query returns instantaneously when I run it from cqlsh (although it seems it only hits one node when I run it from there)
there are other queries issued which take 2-3 minutes (a lot longer than the 10s timeout) that do not timeout at all
I cannot trace the query in Java because it times out. Tracing the query in cqlsh didn't provide much insight. I'd rather not change the Cassandra timeouts as this is production system and I'd like to exhaust non-invasive options first. The Cassandra nodes all have plenty of heap, their heap is far from full, and GC times seem normal.
Any ideas/directions will be much appreciated, I'm totally out of ideas. Cassandra version is 2.0.2, using com.datastax.cassandra:cassandra-driver-core:2.0.2 Java client.

A few things I noticed:
While you are using time as a clustering key, it doesn't really help you because your query is not restricting by your partition key (user_id). Cassandra only orders by clustering keys within a partition. So right now your query is pulling back the first row which satisfies your WHERE clause, ordered by the hashed token value of user_id. If you really do have tens of millions of rows, then I would expect this query to pull back data from the same user_id (or same select few) every time.
"although it seems it only hits one node when I run it from there" Actually, your queries should only hit one node when you run them. Introducing network traffic into a query makes it really slow. I think the default consistency in cqlsh is ONE. This is where Carlo's idea comes into play.
What is the cardinality of article_id? Remember, secondary indexes work the best on "middle-of-the-road" cardinality. High (unique) and low (boolean) are both bad.
The ALLOW FILTERING clause should not be used in (production) application-side code. Like ever. If you have 50 million rows in this table, then ALLOW FILTERING is first pulling all of them back, and then trimming down the result set based on your WHERE clause.
Suggestions:
Carlo might be on to something with the suggestion of trying a different (lower) consistency level. Try setting a consistency level of ONE in your application and see if that helps.
Either perform an ALLOW FILTERING query, or a secondary index query. They both suck, but definitely do not do both together. I would not use either. But if I had to pick, I would expect a secondary index query to suck less than an ALLOW FILTERING query.
To solve this adequately at the scale in which you are describing, I would duplicate the data into a query table. As it looks like you are concerned with organizing time-sensitive data, and in getting the most-recent data. A query table like this should do it:
CREATE TABLE tablebydaybucket (
user_id varchar,
article_id varchar,
time timestamp,
day_bucket varchar,
PRIMARY KEY (day_bucket , time))
WITH CLUSTERING ORDER BY (time DESC);
Populate this table with your data, and then this query will work:
SELECT * FROM tablebydaybucket
WHERE day_bucket='20150519' AND time > '2015-05-19 15:38:49-0500' LIMIT 1;
This will partition your data by day_bucket, and cluster your data by time. This way, you won't need ALLOW FILTERING or a secondary index. Also your query is guaranteed to hit only one node, and Cassandra will not have to pull all of your rows back and apply your WHERE clause after-the-fact. And clustering on time in DESCending order, helps your most-recent rows come back quicker.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.