Couchbase - deleting old documents based on TTL

Couchbase - deleting old documents based on TTL - java

I have a couchbase bucket which has a number of documents. Over a period of time , I see that these documents are rapidly taking up a lot of storage space. I am now working on setting TTL for all the new documents that will be stored. Is there way to set TTL for all the existing document or delete the existing documents based on expiry time? Different documents have different expiry time based on the document type (ranging from 15 minutes to 1 month). Please can you suggest an approach that I can use?

You can set the Expiry on a document and then update that document. Of course, you'd have to go through all the documents and set the Expiry for each.
I don't know how to do this in Java, but it's probably similar to .NET:
// get the document into a variable named 'doc', then
doc.Expiry = 123;
_bucket.Update(doc);
If you only have a few well-known documents, then this should be easy.
You can also use a N1QL query to retrieve documents based on expiry time. See this blog post for more information, but the gist is a query like this:
SELECT META(default).id, *
FROM default
WHERE DATE_DIFF_STR(STR_TO_UTC(exp_datetime),MILLIS_TO_UTC(DATE_ADD_MILLIS(NOW_MILLIS(),30,"second")),"second") < 30
AND STR_TO_UTC(exp_datetime) IS NOT MISSING;
Which would select documents that will expire in less than 30 seconds. So you could write a N1QL DELETE query that uses a WHERE clause.
UPDATE: A coworker at Couchbase pointed me to issue MB-16242. You can't set the expiry with a N1QL UPDATE yet. But as I said above, you can SELECT/DELETE documents based on the expiry.

Related

Update item without modifying the expiry

Is there a way to update an item in couchbase without altering its expiration time? I am using Java SDK and Couchbase 3

No, this is not possible right now. The simple reason is that the underlying protocol does not allow for it - everytime the document is modified its expiration time is reset.
The only reasonable workaround I can think of right now can be used when your expiration times are long and a small change won't matter: when you create a view you can grab the TTL as part of the meta information. So you load the current TTL and write the new document with this TTL (maybe then even substracting the time your business processing took). This would approximate it (and it can also work with N1QL).

Dynamodb Range query gives limited number of results

I'm trying to implement application using google guice framework with dynamodb database.
I have implemented API for finding documents by range query ie. time period when I query by Month it gives limited number of documents i.e 3695 documents and again I search by start time and end time it also gives same number of documents which does not contain newly created document.
Please find the way to implement API which will solve the limitation issues of application or dynamodb.

The response of dynamodb is limited to 1mb per page. Also when your resultset is bigger, you only get the first results till response size is 1MB.
In the docs:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#Pagination
Is described how to use the meta data of the response to see real amount of results, starting index and so on. To query the hole result in batches / pages.
Important excerpt of the docs:
If you query or scan for specific attributes that match values that
amount to more than 1 MB of data, you'll need to perform another Query
or Scan request for the next 1 MB of data. To do this, take the
LastEvaluatedKey value from the previous request, and use that value
as the ExclusiveStartKey in the next request. This will let you
progressively query or scan for new data in 1 MB increments.
When the entire result set from a Query or Scan has been processed,
the LastEvaluatedKey is null. This indicates that the result set is
complete (i.e. the operation processed the “last page” of data).

HBase Table Design for maintaining hourly visitors count per source

I am working on a project where I have to report the hourly unique visitors per source. That is I have to calculate unique visitors for each source for each hour. Visitors are identified by a unique id. What should be the design so that calculation of hourly unique visitors is efficient considering the data is of the order of 20k entries per 8 hours.
At present I am using sourceid+
visitorid as the row key.

Let's start by saying that 2500k entries per hour is a pretty low volume of data (not even 1/second). Unless you want to scale massively your project would be easily achievable with a single SQL server.
Anyway, you have 2 options:
1. Non-realtime
Log every visitorid+source and run a job (like mapreduce) to analyze the data every hour, or every day depending on your needs. In this case you can even completely avoid hbase and just stick to hadoop. You can log the data to a different file each hour, process it afterwards and store the results in SQL (or in HBase if you wish). Performance wise this would be the best approach.
2. Realtime
Track the data realtime by making use of HBase counters, in this case I'd consider using 2 tables:
Table unique_users: to track the last time a visitorid has visited the site (rowkey would be visitorid+source or just visitorid, depending on if a visitor id can have different sources or just one). This table can have a TTL of 3600 seconds if you want to automatically discard old data as soon as you can but I would let a few days of data.
Table date_source_stats: to track the unique visitorid per source per hour. This table can have a TTL of a few weeks or even years depending on your retention requirements.
When a visitor enters your site you read the unique_users table to check the last access date, if that date is older than 1 hour consider it a new visit and increment the counter for the date+hour+sourceid combination in the date_source_stats table. Afterwards, update the unique_users to set the last visit time to the current time.
That way, you can easily retrieve all the unique visits for a particular date+hour with a scan and get all the sources. You may also consider a source_date_stats table in case you want to perform queries for an specific source, i.e, an hourly report for last 7 days for X source... (you can even store all the stats in the same table by using different rowkeys).
Please notice a few things about this approach:
I've not being too detailed about the schemas, let me know if you need me to.
I would also store total visits in another counter (which would be incremented always regardless of if it's unique or not), it's an
useful value.
This proposal can be easily extended as much as you want to also track daily, weekly, and even monthly unique visitors, you'll just
need more counters and rowkeys: date+sourceid, month+sourceid... In this case you can have multiple column families with distinct TTL properties to adjust the retention policy of each set.
This proposal could face hotspotting issues due rowkeys being sequential if you have thousands of reqs per second, you can read more
about it here.
An alternative approach for date_source_stats could be to opt for a wide design in which you have just a sourceid as rowkey and the date_hour as columns.

Partial update on field that is not indexed

Let's consider the following situation - there are two fields in "article" document - content(string) and views(int). The views field is not indexed. The views field contains information how many times this article was read.
From official doc:
We also said that documents are immutable: they cannot be changed,
only replaced. The update API must obey the same rules. Externally, it
appears as though we are partially updating a document in place.
Internally, however, the update API simply manages the same
retrieve-change-reindex process that we have already described.
But what if we do particial update of not indexed field - will elasticsearch reindex the entire document? For example - I want to update views every time someone reads some article. If entire document is reindexed I can't do real time update (as it's too heavy operation). So I will have to work with delay, for example updates all articles the visitors have read every 3-5-10 minutes. Or I understand something wrong?

But what if we do particial update of not indexed field - will elasticsearch reindex the entire document?
Yes, whilst the views field is not indexed individually it is part of the _source field. The _source field contains the original JSON you sent to Elasticsearch when you indexed the document and is returned in the results if there is a match on the document during a search. The _source field is indexed with the document in Lucene. In your update script you are changing the _source field so the whole document will be re-indexed.
Could you then evaluate the following strategy. Every time someone reads the article I send update to elastic. However refresh_interval I set to 30 seconds. Will this strategy be normal if during 30 second interval about 1000 users have read one article?
You are still indexing the 1000 documents, 1 document will be indexed as the current document, 999 documents will be indexed marked as deleted and removed from the index during the next Lucene merge.

ElasticSearch: creating new inverted-index after every update

I've stucked with one question in my understanding of ElasticSearch indexing process. I've already read this article, which says, that inverted-index stores all tokens of all documents and it is immutable. So, to update it we must remove it and reindexing all data to have all document searchable.
But I've read about partial updating the documents (automaticaly marking them to "deleted" and inserting+indexing new one). But in those article where no mention about reindexing all previous data.
So, I do not understand properly next: when I update the document (text document with 100 000 words) and already have in storage some other indexed document - is it true that I will have on every UPDATE or INSERT operation reindexing process of all my documents?
Basicly I rely on default ElasticSearch settings (5 primary shards with one replica per shard and 2 nodes in cluster)

You can just have a document updated (that is reindexed, which is basically the same as removing from index and adding it again), see: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/update-doc.html This will take care of the whole index, so you won't need to reindex every other document.
I'm not sure what you mean by "save" operation, you may want to clarify it with an example.
As of the time required to update a document of 100K words, I suggest you to try it out.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.