I have this table in cassandra :
CREATE TABLE adress (
adress_id uuid,
adress_name text,
key1 text,
key2 text,
key3 text,
key4 text,
effective_date timestamp,
value text,
active boolean,
PRIMARY KEY ((adress_id, adress_name), key1, key2, key3, key4, effective_date)
)
As I can understand, cassandra will distribute the data of the table adress based on the partition key which is (adress_id, adress_name).
There is a risk when I try to insert too many data where they share the same (adress_id,adress_name)..
I would like to check before inserting data, the check happen like this:
how many data do I already have in cassandra with the couple (adress_id, adress_name), lets suppose it's 5MO.
I need to check that the size of data that I'm trying to insert don't exceed the Cassandra limit per partition key minus the existing data in cassandra.
My question is how to query cassandra to get the size of data with the couple (adress_id, adress_name).
After that what is the size limit of a partition key in Cassandra.
As Alex Ott noted above, you should spend more time on the data model to avoid the possibility of huge partitions in the first place, by organizing your data differently, or by artificially splitting partitions to more pieces (e.g., time-series data often splits data into a separate partition each day, for example).
It is technically possible to figure out the existing size of a partition, but it will never be efficient. To understand why, you need to recall how Cassandra stores data. The content of a single partition isn't always stored in the same sstable (on-disk file) - data for the same partition may be spread across multiple files. One file may have a few rows, another file may have a few more rows, a third file may delete or modify some old rows, and so on. To figure out the length of the partition, Cassandra would need to read all this all data, merge it together, and measure the size of the result. Cassandra does not normally do this on writes - it just writes the new update to memory (and eventually a new sstable), without reading the old data first. This is what makes writes in Cassandra so fast - and your idea to read the entire partition before each write will drastically slow them down.
Finally while Cassandra does not handle huge partitions very well, there is no inherent reason why it never could if the developers wanted to solve this issue. The developers of the Cassandra clone Scylla a worried about this issue, and are working to improve it, but even in Scylla the handling of huge partitions isn't perfect yet. But eventually it will be. Almost - there will always be a limit for the size of a single partition (which, by definition, is stored on a single node) as the size of a single disk. This limit too may become a serious problem if your data model is really broken and you can end up with a terabyte in a single partition.
Related
I have a set (set1)
Bins :
bin1 (PK = key1)
bin2 (PK = key1)
bin3 (PK = key2)
bin4 (PK = key2)
Which is more optimized way(in terms of query time, cpu usage, failure cases for 1 client call vs 2 client calls) for querying the data from aerospike client from the below 2 approaches:
Approach 1 : Make 1 get call using aeropsike client which has bins = [bin1, bin2, bin3, bin4] and keys = [key1, key2]
Approach 2 : Make 2 aerospike client get calls. First call will have bins = [bin1, bin2] and keys = [key1] and Second call will have bins = [bin3, bin4] and keys = [key2]
I find Approach 2 more cleaner, since in Approach 1 we will try to get the record for all combinations (e.g. : bin1 with key2 as primary key) and it will be extra computation and the primary key set can be large. But the disadvantage of Approach 2 is two Aerospike client calls.
A. Batch reads vs. multiple single reads
This is kind of a false choice. Yes, you could make a batch call for [key1, key2] (1), and you shouldn't specify bin1, bin2, bin3, bin4, just get the full records without selecting bins. Or you could make two independent get() calls, one for key1, one for key2 (2).
However, there's no reason you need to read key1, wait for the result, then read key2. You can read them with a synchronous get(key1) in one thread, and a synchronous get(key2) in another thread. The Java client can handle multi-threaded use. Alternatively, you can async get(key1) and immediately async get(key2).
Batch reads (such as in (1)) are not as efficient as single reads when the number of records is smaller than at least the number of nodes in the cluster. The records are evenly distributed, so if you have a 4 node cluster, and you make a batch request with 4 keys, you end up with parallel sub-batches of roughly 1 record per-node. The overhead associated with batch-reads isn't worth it when that's the case. See more about batch index in the docs and the knowledge base FAQ - batch-index tuning parameters. The FAQ - Differences between getting single record versus batch should answer your question.
B. The number of records in an Aerospike database doesn't impact read performance!
You are worried that "the primary key set can be large". That is not a problem at all for Aerospike. In fact, one of the best things about Aerospike is that getting a single record from a database with 1 million records or one with 1 trillion records is pretty much the same big-O computational cost.
Each record has a 64 byte metadata entry in the primary index. The primary index is spread evenly across the nodes of the cluster, because data distribution in Aerospike is extremely even. Each node stores an even share of the partitions, out of 4096 logical partitions for each namespace in the cluster. The partitions are represented as a collection of red-black binary trees (sprigs) with a hash table leading to the correct sprig.
To find any record the client hashes its key into a 20 byte digest. Using 12 bits of the digest the client finds the partition ID, looks it up in the partition map it holds locally, and finds the correct node. Reading the record is now a single hop to the correct node. On that node, a service thread picks up the call from a channel of the network card, looks it up in the correct partition (again, finding the partition ID from the digest is a simple O(1) operation). It hops directly to the correct sprig (also O(1)) and then does a simple O(n log n) binary tree lookup for the record's metadata. Now the service thread knows exactly where to find the record in storage, with a single read IO. I explained this read flow in more detail here (though in version 4.7 transaction queues and threads were removed; the service thread does all the work ).
Another point is that the time spent looking up record metadata in the index is orders of magnitude less than getting the record from storage.
So, the number of records in the cluster doesn't change how fast it takes to read a random record, from a data set of any size.
I wrote an article Aerospike Modeling: User Profile Store that shows how this fact is leveraged to make sub-millisecond reads at millions of transactions-per-second from a petabyte scale data store.
I am trying to insert in batches (Objects are stored in an arraylist and as soon as count is divisible by 10000, I insert all these objects into my table. But it takes more than 4 minutes to do so. Is there any approach which is faster?
arr.add(new Car(name, count, type));
if(count%10000==0){
repository.saveAll(arr);
arr.clear();
}
So here is what is happening. I am most curious to see the table definition inside Cassandra. But given your Car constructor,
new Car(name, count, type)
Given those column names, I'm guessing that name is the partition key.
The reason that is significant, is because the hash of the partition key column is what Cassandra uses to figure out which node (token range) the data should be written to.
When you saveAll on 10000 Cars at once, there is no way you can guarantee that all 10000 of those are going to the same node. To deal with this, Spring Data Cassandra must be using a BATCH (or something like it) behind the scenes. If it is a BATCH, that essentially puts one Cassandra node (designated as a "coordinator") to route writes to the required nodes. Due to Cassandra's distributed nature, that is never going to be fast.
If you really need to store 10000 of them, the best way would be send one write at a time asynchronously. Of course, you won't want 10000 threads all writing concurrently, so you'll want to throttle-down (limit) the number of active threads in your code. DataStax's Ryan Svihla has written a couple of articles detailing how to do this. I recommend this one- Cassandra: Batch Loading Without the Batch - The Nuanced Edition.
tl;dr;
Spring Data Cassandra's saveAll really shouldn't be used to persist several thousand writes. If I were using Spring Data Cassandra, I wouldn't even go beyond double-digits with saveAll, TBH.
Edit
Check out this answer for details on how to use Spring Boot/Data with Cassandra asyncrhonously: AsyncCassandraOperations examples
Background I am using SQLite to store around 10M entries, where the size of each entry is around 1Kb. I am reading this data back in chunks of around 100K entries at a time, using multiple parallel threads. Read and writes are not going in parallel and all the writes are done before starting the reads.
Problem I am experiencing too many disk reads. Each second around 3k reads are happening and I am reading only 30Kb data in those 3k reads (Hence around 100 bytes per disk read). As the result, I am seeing a really horrible performance (It is taking around 30 minutes to read the data)
Question
Is there any SQlite settings/pragmas that I can use to avoid the small size disk reads?
Are there any best practices for batch parallel reads in SQlite?
Does SQlite read all the results of a query in one go? Or read the results in smaller chunks? If latter is the case, then where does it stone partial out of a query
Implementation Details My using SQlite with Java and my application is running on linux. JDBC library is https://github.com/xerial/sqlite-jdbc (Version 3.20.1).
P.S I am already built the necessary Indexes and verified that no table scans are going on (using Explain Query planner)
When you are searching for data with an index, the database first looks up the value in the index, and then goes to the corresponding table row to read all the other columns.
Unless the table rows happen to be stored in the same order as the values in the index, each such table read must go to a different page.
Indexes speed up searches only if the seach reduces the number of rows. If you're going to read all (or most of the) rows anyway, a table scan will be much faster.
Parallel reads will be more efficient only if the disk can actually handle the additional I/O. On rotating disks, the additional seeks will just make things worse.
(SQLite tries to avoid storing temporary results. Result rows are computed on the fly (as much as possible) while you're stepping through the cursor.)
My Cassandra table has following schema
CREATE TABLE cachetable1 (
id text,
lsn text,
lst timestamp,
PRIMARY KEY ((id))
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='{"keys":"ALL", "rows_per_partition":"ALL"}' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.000000 AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Above table contains 221 Million rows (approx. 16 GB data). The CassandraDaemon is running with 4GB heap space and I have configured 4 GB memory for row cache. I am try to run select queries from my java code like this
for(int i = 0; i < 1000; i ++)
{
int id = random.nextInt(20000000 - 0) + 0;
for(j = id; j <= id + 100; j++)
{
ls.add(j+"");
}
Statement s = QueryBuilder.select("lst","lsn").from("ks1" , "cachetable1").where(QueryBuilder.in("id",ls.toArray()));
s.setFetchSize(100);
ResultSet rs=sess.execute( s );
List<Row> lsr=rs.all();
for(Row rw:lsr)
{
//System.out.println(rw.toString());
count++;
}
ls.clear();
}
In above code, I am trying to fetch 0.1 Million records. But the read/get performance is very bad. It takes 400-500 seconds to fetch 0.1 Million rows. Is there any better way to read/get records from Cassandra through Java? Is some tuning required other than row cache size and Cassandra heap size?
You appear to want to retrieve your data in 100 row chunks. This sounds like a good candidate for a clustering column.
Change your schema to use an id as the partition key and a chunk index as a clustering column, i.e. PRIMARY KEY ( (id), chunk_idx ). When you insert the data, you will have to figure out how to map your single indexes into an id and chunk_idx (e.g. perhaps do a modulo 100 on one of your values to generate a chunk_idx).
Now when you query for an id and don't specify a chunk_idx, Cassandra can efficiently return all 100 rows with one disk read on the partition. And you can still do range queries and retrievals of single rows within the partition by specifying the chunk_idx if you don't always want to read a whole chunk of rows.
So your mistake is you are generating 100 random partition reads with each query, and this will hit all the nodes and require a separate disk read for each one. Remember that just because you are querying for sequential index numbers doesn't mean the data is stored close together, and with Cassandra it is exactly the opposite, where sequential partition keys are likely stored on different nodes.
The second mistake you are making is you are executing the query synchronously (i.e. you are issuing the query and waiting for the request to finish before you issue any more queries). What you want to do is use a thread pool so that you can have many queries running in parallel, or else use the executeAsync method in a single thread. Since your query is not efficient, waiting for the 100 random partition reads to complete is going to be a long wait, and a lot of the highly pipelined Cassandra capacity is going to be sitting there twiddling its thumbs waiting for something to do. If you are trying to maximize performance, you want to keep all the nodes as busy as possible.
Another thing to look into is using the TokenAwarePolicy when connecting to your cluster. This allows each query to go directly to a node that has a replica of the partition rather than to a random node that might have to act as a coordinator and get the data via an extra hop. And of course using consistency level ONE on reads is faster than higher consistency levels.
The row cache size and heap size are not the source of your problem, so that's the wrong path to go down.
I am going to guess that this is your culprit:
.where(QueryBuilder.in("id",ls.toArray()))
Use of the IN relation in the WHERE clause is widely known to be non-performant. In some case, performing many parallel queries can be faster than using one IN query. From the DataStax SELECT documentation:
When not to use IN
...Using IN can degrade performance because usually many nodes must be
queried. For example, in a single, local data center cluster with 30
nodes, a replication factor of 3, and a consistency level of
LOCAL_QUORUM, a single key query goes out to two nodes, but if the
query uses the IN condition, the number of nodes being queried are
most likely even higher, up to 20 nodes depending on where the keys
fall in the token range.
So you have two options (assuming that living with this poor-performing query isn't one of them):
Rewrite your code to make multiple, parallel requests for each id.
Revisit your data model to see if you have another value that it makes sense to key your data by. For instance, if all of your ids in ls happen to share a common column value that is unique to them, that's a good candidate for a primary key. Basically, find another way to query all of the ids that you are looking for, and build a specific query table to support that.
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.