We have approx 1 million keys in CB and each key has expiry time of 10 days.
Basically after every 10 days we run a schedular to re-insert data in CB for these 1 million keys.
Out of these there are approx 0.5 million keys which are not used. There is no way to identify which keys will be used and which will not.
I have checked if there is a way to define 2 TTL for a key.
Delete record after fixed certain time period. (expiry)
Delete record if key not used for certain time period
Is there some way to delete these unused keys.
This is something that is easily implemented in the application tier. Say you wanted all new keys to be deleted after 90 days and delete any keys that are unused for 30 days.
When you create the document, add a creation timestamp field and a TTL of 30 days. Then when your application reads/updates a document, you can set a new TTL on the document which is calculated as the timestamp + 90 days.
After 30 days any unused docs will be expired, documents which have been accessed and updated will have a TTL in 60 additional days.
Thanks,
Ian McCloy (Principal Product Manager, Couchbase)
Related
I want to do timer operation on After submitting a form if data has been approved within 6 hrs then only it will be updated in database otherwise delete the record. How do I do it?
Maybe you can add a column about submitting time. Before someone approve submit, you can verification time column is not expired. In addition, you can add a column to describe which row is approved.
I would store a timestamp with each record, and write the code to purge all the records older than six ours. I would run that code regularily (say every 15 minutes). At the record creation, I could also schedule a task for 6 hours later that would run the code.
I have a couchbase bucket which has a number of documents. Over a period of time , I see that these documents are rapidly taking up a lot of storage space. I am now working on setting TTL for all the new documents that will be stored. Is there way to set TTL for all the existing document or delete the existing documents based on expiry time? Different documents have different expiry time based on the document type (ranging from 15 minutes to 1 month). Please can you suggest an approach that I can use?
You can set the Expiry on a document and then update that document. Of course, you'd have to go through all the documents and set the Expiry for each.
I don't know how to do this in Java, but it's probably similar to .NET:
// get the document into a variable named 'doc', then
doc.Expiry = 123;
_bucket.Update(doc);
If you only have a few well-known documents, then this should be easy.
You can also use a N1QL query to retrieve documents based on expiry time. See this blog post for more information, but the gist is a query like this:
SELECT META(default).id, *
FROM default
WHERE DATE_DIFF_STR(STR_TO_UTC(exp_datetime),MILLIS_TO_UTC(DATE_ADD_MILLIS(NOW_MILLIS(),30,"second")),"second") < 30
AND STR_TO_UTC(exp_datetime) IS NOT MISSING;
Which would select documents that will expire in less than 30 seconds. So you could write a N1QL DELETE query that uses a WHERE clause.
UPDATE: A coworker at Couchbase pointed me to issue MB-16242. You can't set the expiry with a N1QL UPDATE yet. But as I said above, you can SELECT/DELETE documents based on the expiry.
I am working on a sensor data(timeseries). Number of columns in a table is 3000.
for eg: nodeid,timestamp,sen1,sen2,.....sen-n. nodeid and timestamp are primary key with clustering order by timestamp.
Number of records are 10000.
When a SELECT query for single column(SELECT timestamp,sen1 FROM <table>) requ is requsted through cassandra datastax java driver 3.0 it is replies in 15 sec; i.e If I want read all the tags, one a tag at a time for all 3000 tags requires 3000*15 sec = 12 to 13 hours aproximately. It is on single node cluster with 16GB RAM.
I allocated 10GB for JVM. Still response time is not changed. I used LevelCompactionStragy at the time of table creation.
Hardware: Intel Core i7 and Normal Hard disk not SSD,8GB RAM
How to reduce that read or query time on the single node cluster?
Obviously, there is problem with data modelling. IMO, a table with 3000 columns is bad. if your use case is like "SELECT timestamp,sen1 FROM ", then you should model it as " Primary Key(Timestamp, SensorId) ".
"SELECT timestamp,sen1" in your model, cassandra will still read all other column values from disk into memory.
I am not sure what is 'nodeId' in your case.. I hope it's not cassandra node id..
(SELECT timestamp,sen1 FROM table)
This is like getting all the data at once(in your case 10000 records).
So getting 1 column or 3000 columns will make Cassandra server to read through all the SSTables. The point is it won't be 12 or 13 hours.
Still 15 seconds seems unbelievable. Did you also include the network latency and client side write in this measure?
As mentioned in one of the answers your model seems to be bad (If you put timestamp as partion key, the data becomes two sparse and getting a range of data will need to read from more than one partition. If you use only node_id as partition key, the partition will host too much data and can cross the C* limitation of 2 Billion). My advise is
Redesign your partition key. Please check this tutorial for a start.
https://academy.datastax.com/resources/getting-started-time-series-data-modeling
Add more no. of nodes and increase replication factor to see better read latencies.
Try to design your read query such that it reads from only one partition at once. eg: SELECT * from Table where sensor_node_id = abc and year = 2016 and month = June
Hope this helps!
I am working on a project where I have to report the hourly unique visitors per source. That is I have to calculate unique visitors for each source for each hour. Visitors are identified by a unique id. What should be the design so that calculation of hourly unique visitors is efficient considering the data is of the order of 20k entries per 8 hours.
At present I am using sourceid+
visitorid as the row key.
Let's start by saying that 2500k entries per hour is a pretty low volume of data (not even 1/second). Unless you want to scale massively your project would be easily achievable with a single SQL server.
Anyway, you have 2 options:
1. Non-realtime
Log every visitorid+source and run a job (like mapreduce) to analyze the data every hour, or every day depending on your needs. In this case you can even completely avoid hbase and just stick to hadoop. You can log the data to a different file each hour, process it afterwards and store the results in SQL (or in HBase if you wish). Performance wise this would be the best approach.
2. Realtime
Track the data realtime by making use of HBase counters, in this case I'd consider using 2 tables:
Table unique_users: to track the last time a visitorid has visited the site (rowkey would be visitorid+source or just visitorid, depending on if a visitor id can have different sources or just one). This table can have a TTL of 3600 seconds if you want to automatically discard old data as soon as you can but I would let a few days of data.
Table date_source_stats: to track the unique visitorid per source per hour. This table can have a TTL of a few weeks or even years depending on your retention requirements.
When a visitor enters your site you read the unique_users table to check the last access date, if that date is older than 1 hour consider it a new visit and increment the counter for the date+hour+sourceid combination in the date_source_stats table. Afterwards, update the unique_users to set the last visit time to the current time.
That way, you can easily retrieve all the unique visits for a particular date+hour with a scan and get all the sources. You may also consider a source_date_stats table in case you want to perform queries for an specific source, i.e, an hourly report for last 7 days for X source... (you can even store all the stats in the same table by using different rowkeys).
Please notice a few things about this approach:
I've not being too detailed about the schemas, let me know if you need me to.
I would also store total visits in another counter (which would be incremented always regardless of if it's unique or not), it's an
useful value.
This proposal can be easily extended as much as you want to also track daily, weekly, and even monthly unique visitors, you'll just
need more counters and rowkeys: date+sourceid, month+sourceid... In this case you can have multiple column families with distinct TTL properties to adjust the retention policy of each set.
This proposal could face hotspotting issues due rowkeys being sequential if you have thousands of reqs per second, you can read more
about it here.
An alternative approach for date_source_stats could be to opt for a wide design in which you have just a sourceid as rowkey and the date_hour as columns.
I'm using Hibernate 4 to manage all the databases connection.
In the table i'm creating i'd like to keep only the last 24 hour data for statistic calculation.
Is there a way to automatic delete older data on the table (obvius there's a field EVENTDATA of type DATETIME) or i've to do this manually every x minute?
You could use job scheduling with cron trigger to achive this. if you use cron expression 0 0 0 * * then the delete trigger will invoke for every night at 00:00