Regarding Cassandra Read Performance

Regarding Cassandra Read Performance - java

I am working on a sensor data(timeseries). Number of columns in a table is 3000.
for eg: nodeid,timestamp,sen1,sen2,.....sen-n. nodeid and timestamp are primary key with clustering order by timestamp.
Number of records are 10000.
When a SELECT query for single column(SELECT timestamp,sen1 FROM <table>) requ is requsted through cassandra datastax java driver 3.0 it is replies in 15 sec; i.e If I want read all the tags, one a tag at a time for all 3000 tags requires 3000*15 sec = 12 to 13 hours aproximately. It is on single node cluster with 16GB RAM.
I allocated 10GB for JVM. Still response time is not changed. I used LevelCompactionStragy at the time of table creation.
Hardware: Intel Core i7 and Normal Hard disk not SSD,8GB RAM
How to reduce that read or query time on the single node cluster?

Obviously, there is problem with data modelling. IMO, a table with 3000 columns is bad. if your use case is like "SELECT timestamp,sen1 FROM ", then you should model it as " Primary Key(Timestamp, SensorId) ".
"SELECT timestamp,sen1" in your model, cassandra will still read all other column values from disk into memory.
I am not sure what is 'nodeId' in your case.. I hope it's not cassandra node id..

(SELECT timestamp,sen1 FROM table)
This is like getting all the data at once(in your case 10000 records).
So getting 1 column or 3000 columns will make Cassandra server to read through all the SSTables. The point is it won't be 12 or 13 hours.
Still 15 seconds seems unbelievable. Did you also include the network latency and client side write in this measure?
As mentioned in one of the answers your model seems to be bad (If you put timestamp as partion key, the data becomes two sparse and getting a range of data will need to read from more than one partition. If you use only node_id as partition key, the partition will host too much data and can cross the C* limitation of 2 Billion). My advise is
Redesign your partition key. Please check this tutorial for a start.
https://academy.datastax.com/resources/getting-started-time-series-data-modeling
Add more no. of nodes and increase replication factor to see better read latencies.
Try to design your read query such that it reads from only one partition at once. eg: SELECT * from Table where sensor_node_id = abc and year = 2016 and month = June
Hope this helps!

Related

Select all records from offset to limit using a postgres index

I want to get all data from offset to limit from a table with about 40 columns and 1.000.000 rows. I tried to index the id column via postgres and get the result of my select query via java and an entitymanager.
My query needs about 1 minute to get my results, which is a bit too long. I tried to use a different index and also limited my query down to 100 but still it needs this time. How can i fix it up? Do I need a better index or is anything wrong with my code?
CriteriaQuery<T> q = entityManager.getCriteriaBuilder().createQuery(Entity.class);
TypedQuery<T> query = entityManager.createQuery(q);
List<T> entities = query.setFirstResult(offset).setMaxResults(limit).getResultList();

Right now you probably do not utilize the index at all. There is some ambiguity how a hibernate limit/offset will translate to database operations (see this comment in the case of postgres). It may imply overhead as described in detail in a reply to this post.
If you have a direct relationship of offset and limit to the values of the id column you could use that in a query of the form
SELECT e
FROM Entity
WHERE id >= offset and id < offset + limit
Given the number of records asked for is significantly smaller than the total number of records int the table the database will use the index.
The next thing is, that 40 columns is quite a bit. If you actually need significantly less for your purpose, you could define a restricted entity with just the attributes required and query for that one. This should take out some more overhead.
If you're still not within performance requirements you could chose to take a jdbc connection/query instead of using hibernate.
Btw. you could log the actual sql issued by jpa/hibernate and use it to get an execution plan from postgress, this will show you what the query actually looks like and if an index will be utilized or not. Further you could monitor the database's query execution times to get an idea which fraction of the processing time is consumed by it and which is consumed by your java client plus data transfer overhead.

There also is a technique to mimick the offset+limit paging, using paging based on the page's first record's key.
Map<Integer, String> mapPageTopRecNoToKey = new HashMap<>();
Then search records >= page's key and load page size + 1 records to find the next page.
Going from page 1 to page 5 would take a bit more work but would still be fast.
This of course is a terrible kludge, but the technique at that time indeed was a speed improvement on some databases.
In your case it would be worth specifying the needed fields in jpql: select e.a, e.b is considerably faster.

Deleting 190 million records from Oracle

We have some useless historical data in a database which sums upto 190 million (19 crores) rows in database contributing to 33-GB . Now I got a task to delete these much rows in one go and if in any case something breaks, I should be able to rollback the transaction.
I will select them based on some flag like deleted ='1' which from my estimation counts to 190 million out of 200 million. So first I have to do a select operation and then delete those id's.
As mentioned in this article, it is taking 4 hours to delete 1.5 million records, which count is far less than my case and I am wondering if I proceed with single deleted approach how much time it would take to delete 190 million records.
Should I use Spring-Batch for selecting id's of rows and then delete them batch by batch or issue a single statement by passing id's in IN clause.
What would be a better approach please suggest.

Why not moving the required data from historical table to a new table and dropping the old table entirely? You might rename the new table to old table name later on.

you can do copying required data from historical table to a new table and drop the old table entirely and rename the new table to old table name later -- as said by Raj in above post. this is best way to do.
and also you can use nologging and parallel options to speed up for example :
create table History_new parallel 4 nologging as
select /*+parallel(source 4) */ * from History where col1 = 1 and ... ;

If doing it in Java is not mandatory, I'd create a PL/SQL procedure, open a cursor and use DELETE ... WHERE CURRENT OF. Maybe it's not super fast, but it's secure because you will have no rollback segment problems. Using a normal DELETE even without transaction is an atomic operation that must be rolled back if something fails.

Maybe what you said is usual and normal performance for Java, but at my notebook deleting of 1M records requires about a minute - without Java, of course.
If you wish to do it good, I'd say you should use partitions. First of all, convert the plain table(s) into the partitioned one(s) with all data into one (current) partition. Then, prepare "historical" partitions and move unnecessary data into them. And after that you'll be ready to do anything. You'll can to move it offline (but restore when needed), you'll be able to exclude this data in seconds using EXCHANGE PARTITION and so on.

Strange Cassandra ReadTimeoutExceptions, depending on which client is querying

I have a cluster of three Cassandra nodes with more or less default configuration. On top of that, I have a web layer consisting of two nodes for load balancing, both web nodes querying Cassandra all the time. After some time, with the data stored in Cassandra becoming non-trivial, one and only one of the web nodes started getting ReadTimeoutException on a specific query. The web nodes are identical in every way.
The query is very simple (? is placeholder for date, usually a few minutes before the current moment):
SELECT * FROM table WHERE time > ? LIMIT 1 ALLOW FILTERING;
The table is created with this query:
CREATE TABLE table (
user_id varchar,
article_id varchar,
time timestamp,
PRIMARY KEY (user_id, time));
CREATE INDEX articles_idx ON table(article_id);
When it times-out, the client waits for a bit more than 10s, which, not surprisingly, is the timeout configured in cassandra.yaml for most connects and reads.
There are a couple of things that are baffling me:
the query only timeouts when one of the web nodes execute it - one of the nodes always fail, one of the nodes always succeed.
the query returns instantaneously when I run it from cqlsh (although it seems it only hits one node when I run it from there)
there are other queries issued which take 2-3 minutes (a lot longer than the 10s timeout) that do not timeout at all
I cannot trace the query in Java because it times out. Tracing the query in cqlsh didn't provide much insight. I'd rather not change the Cassandra timeouts as this is production system and I'd like to exhaust non-invasive options first. The Cassandra nodes all have plenty of heap, their heap is far from full, and GC times seem normal.
Any ideas/directions will be much appreciated, I'm totally out of ideas. Cassandra version is 2.0.2, using com.datastax.cassandra:cassandra-driver-core:2.0.2 Java client.

A few things I noticed:
While you are using time as a clustering key, it doesn't really help you because your query is not restricting by your partition key (user_id). Cassandra only orders by clustering keys within a partition. So right now your query is pulling back the first row which satisfies your WHERE clause, ordered by the hashed token value of user_id. If you really do have tens of millions of rows, then I would expect this query to pull back data from the same user_id (or same select few) every time.
"although it seems it only hits one node when I run it from there" Actually, your queries should only hit one node when you run them. Introducing network traffic into a query makes it really slow. I think the default consistency in cqlsh is ONE. This is where Carlo's idea comes into play.
What is the cardinality of article_id? Remember, secondary indexes work the best on "middle-of-the-road" cardinality. High (unique) and low (boolean) are both bad.
The ALLOW FILTERING clause should not be used in (production) application-side code. Like ever. If you have 50 million rows in this table, then ALLOW FILTERING is first pulling all of them back, and then trimming down the result set based on your WHERE clause.
Suggestions:
Carlo might be on to something with the suggestion of trying a different (lower) consistency level. Try setting a consistency level of ONE in your application and see if that helps.
Either perform an ALLOW FILTERING query, or a secondary index query. They both suck, but definitely do not do both together. I would not use either. But if I had to pick, I would expect a secondary index query to suck less than an ALLOW FILTERING query.
To solve this adequately at the scale in which you are describing, I would duplicate the data into a query table. As it looks like you are concerned with organizing time-sensitive data, and in getting the most-recent data. A query table like this should do it:
CREATE TABLE tablebydaybucket (
user_id varchar,
article_id varchar,
time timestamp,
day_bucket varchar,
PRIMARY KEY (day_bucket , time))
WITH CLUSTERING ORDER BY (time DESC);
Populate this table with your data, and then this query will work:
SELECT * FROM tablebydaybucket
WHERE day_bucket='20150519' AND time > '2015-05-19 15:38:49-0500' LIMIT 1;
This will partition your data by day_bucket, and cluster your data by time. This way, you won't need ALLOW FILTERING or a secondary index. Also your query is guaranteed to hit only one node, and Cassandra will not have to pull all of your rows back and apply your WHERE clause after-the-fact. And clustering on time in DESCending order, helps your most-recent rows come back quicker.

Cassandra Read/Get Performance

My Cassandra table has following schema
CREATE TABLE cachetable1 (
id text,
lsn text,
lst timestamp,
PRIMARY KEY ((id))
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='{"keys":"ALL", "rows_per_partition":"ALL"}' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.000000 AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Above table contains 221 Million rows (approx. 16 GB data). The CassandraDaemon is running with 4GB heap space and I have configured 4 GB memory for row cache. I am try to run select queries from my java code like this
for(int i = 0; i < 1000; i ++)
{
int id = random.nextInt(20000000 - 0) + 0;
for(j = id; j <= id + 100; j++)
{
ls.add(j+"");
}
Statement s = QueryBuilder.select("lst","lsn").from("ks1" , "cachetable1").where(QueryBuilder.in("id",ls.toArray()));
s.setFetchSize(100);
ResultSet rs=sess.execute( s );
List<Row> lsr=rs.all();
for(Row rw:lsr)
{
//System.out.println(rw.toString());
count++;
}
ls.clear();
}
In above code, I am trying to fetch 0.1 Million records. But the read/get performance is very bad. It takes 400-500 seconds to fetch 0.1 Million rows. Is there any better way to read/get records from Cassandra through Java? Is some tuning required other than row cache size and Cassandra heap size?

You appear to want to retrieve your data in 100 row chunks. This sounds like a good candidate for a clustering column.
Change your schema to use an id as the partition key and a chunk index as a clustering column, i.e. PRIMARY KEY ( (id), chunk_idx ). When you insert the data, you will have to figure out how to map your single indexes into an id and chunk_idx (e.g. perhaps do a modulo 100 on one of your values to generate a chunk_idx).
Now when you query for an id and don't specify a chunk_idx, Cassandra can efficiently return all 100 rows with one disk read on the partition. And you can still do range queries and retrievals of single rows within the partition by specifying the chunk_idx if you don't always want to read a whole chunk of rows.
So your mistake is you are generating 100 random partition reads with each query, and this will hit all the nodes and require a separate disk read for each one. Remember that just because you are querying for sequential index numbers doesn't mean the data is stored close together, and with Cassandra it is exactly the opposite, where sequential partition keys are likely stored on different nodes.
The second mistake you are making is you are executing the query synchronously (i.e. you are issuing the query and waiting for the request to finish before you issue any more queries). What you want to do is use a thread pool so that you can have many queries running in parallel, or else use the executeAsync method in a single thread. Since your query is not efficient, waiting for the 100 random partition reads to complete is going to be a long wait, and a lot of the highly pipelined Cassandra capacity is going to be sitting there twiddling its thumbs waiting for something to do. If you are trying to maximize performance, you want to keep all the nodes as busy as possible.
Another thing to look into is using the TokenAwarePolicy when connecting to your cluster. This allows each query to go directly to a node that has a replica of the partition rather than to a random node that might have to act as a coordinator and get the data via an extra hop. And of course using consistency level ONE on reads is faster than higher consistency levels.
The row cache size and heap size are not the source of your problem, so that's the wrong path to go down.

I am going to guess that this is your culprit:
.where(QueryBuilder.in("id",ls.toArray()))
Use of the IN relation in the WHERE clause is widely known to be non-performant. In some case, performing many parallel queries can be faster than using one IN query. From the DataStax SELECT documentation:
When not to use IN
...Using IN can degrade performance because usually many nodes must be
queried. For example, in a single, local data center cluster with 30
nodes, a replication factor of 3, and a consistency level of
LOCAL_QUORUM, a single key query goes out to two nodes, but if the
query uses the IN condition, the number of nodes being queried are
most likely even higher, up to 20 nodes depending on where the keys
fall in the token range.
So you have two options (assuming that living with this poor-performing query isn't one of them):
Rewrite your code to make multiple, parallel requests for each id.
Revisit your data model to see if you have another value that it makes sense to key your data by. For instance, if all of your ids in ls happen to share a common column value that is unique to them, that's a good candidate for a primary key. Basically, find another way to query all of the ids that you are looking for, and build a specific query table to support that.

processing a large number of database entries with paging slows down with time

I am trying to process millions of records from my table (size is about 30 GB) and I am currently doing it using paging (mysql 5.1.36). The query I use in my for loop is
select blobCol from large_table
where name= 'someKey' and city= 'otherKey'
order by name
LIMIT <pageNumber*pageSize>, <pageSize>
This works perfectly fine for about 500K records. I have a page size of 5000 that I am using and after page 100, the queries start slowing down dramatically. The first ~80 pages are extracted in a 2-3 seconds but after around page 130, each page takes about 30 seconds to retrieve, at least until page 200. One of my queries has about 900 pages and that would take too long.
The table structure is (type is MyISAM)
name char(11)
id int // col1 & col2 is a composite key
city varchar(80) // indexed
blobCol longblob
what can i do to speed it up? The explain for the query shows this
select_type: SIMPLE
possible_keys: city
key : city
type: ref
key_len: 242
ref: const
rows: 4293720
Extra: using where; using filesort
In case it helps, the my.cnf for my server (24 GB ram, 2 quad core procs) has these entries
key_buffer_size = 6144M
max_connections = 20
max_allowed_packet = 32M
table_open_cache = 1024
sort_buffer_size = 256M
read_buffer_size = 128M
read_rnd_buffer_size = 512M
myisam_sort_buffer_size = 128M
thread_cache_size = 16
tmp_table_size = 128M
max_heap_table_size = 64M

Here is what I did, and reduced the total execution time by a factor of 10.
What I realized form the execution plan of my original query was that it was using filesort for sorting all results and ignoring the indexes. That is a bit of a waste.
My test database: 5 M records, 20 GB size. table structure same as in the question
Instead of getting blobCol directly in the first query, i first get the value of 'name' for beginning of every page. Run this query indefinitely until it returns 0 results. Every time, add the result to a list
SELECT name
FROM my_table
where id = <anyId> // I use the id column for partitioning so I need this here
order by name
limit <pageSize * pageNumber>, 1
Sine page number is not previously known, start with value 0 and keep incrementing until the query returns null. You can also do a select count(*) but that itself might take long and will not help optimize anything. Each query took about 2 seconds to run once the page number exceeded ~60.
For me, the page size was 5000 so I got a list of 'name' strings at position 0, 5001, 10001, 15001 and so on. The number of pages turned out to be 1000 and storing a list of 1000 results in memory is not expensive.
Now, iterate through the list and run this query
SELECT blobCol
FROM my_table
where name >= <pageHeader>
and name < <nextPageHeader>
and city="<any string>"
and id= 1
This will run N times, where N = size of list obtained previously. Since 'name' is the primary key col, and 'city' is also indexed, EXPLAIN shows that this calculation is performed in memory using the index.
Now, each query takes 1 second to run, instead of the original 30-40. So combining the pre-processing time of 2 seconds per page, total time per page is 3-4 seconds instead of 30-40.
If anyone has a better solution or if there is something glaringly wrong with this one, please let me know

You can make your query more exact so the limit is lower.
SELECT col1,col2, col4
FROM large_table
WHERE col1>"SomeKey" OR
(col1="SomeKey" AND col2>="OtherKey")
ORDER BY col1,col2
LIMIT PageSize
but update "SomeKey" and "OtherKey" after each database call.

I've tried the same in the past with an Oracle 10g database and got the same result (my table had 60 million rows). First pages were retrieved quickly but as page number increased, query got too slow.
There's not much you can do with indexes as they look correct and I'm not sure about what you can achieve by tuning database configuration.
I guess I had different requirements, but the only solution I found was to dump data to files.
If you have a limited set of values for col1, you can get rid of col1 and generate n tables, one for each known value of col1. If col1 is unknown, then I don't know the solution to this.
You can retrieve small sets of data from very large tables, but retrieving large sets of data takes a lot of time and pagination doesn't help you at all. You have to preprocess by dumping to files or generating other tables to partition data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.