Cassandra: How long does a page exist? - java

I paginate through a large collection of data (circa 500 000 000 rows) using PagingState, and do some business intelligence during this process. To be able to resume the process I created this table...
/**
* This table stores temporary paging state
*/
CREATE TABLE IF NOT EXISTS lp_operations.paging_state (
id text, // ID of process
pos bigint, // current position
page text, // paging state
info text, // info json
finished tinyint, // finished
PRIMARY KEY (id)
) WITH default_time_to_live = 28800; // 8 hours
..in which i store current page (string representation of PagingState) and JSON meta data associated with calculation.
Questions
Can 'page' index expire in Cassandra?
How long does it exist (by default)?

No, Cassandra Driver's Paging State will not Expire.
Because Every time you query with paging state, cassandra actually execute your query every time. It don't store your result. Paging State just tell cassandra from which index the driver want the data .
Due to internal implementation details, PagingState instances are not portable across native protocol versions. This could become a problem in the following scenario:
you’re using the driver 2.0.x and Cassandra 2.0.x, and therefore native protocol v2;
a user bookmarks a link to your web service that contains a serialized paging state;
you upgrade your server stack to use the driver 2.1.x and Cassandra 2.1.x, so you’re now using protocol v3;
the user tries to reload their bookmark, but the paging state was serialized with protocol v2, so trying to reuse it will fail.
Source : http://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/

Related

How to evolve schema in Janusgraph?

I uploaded movie and user data to janusgraph and initially I made index on movieId but, later I realised I need to index movie title as well. I need to do query based on movie title and without indexing movie id, it's giving me warning "Query requires iterating over all vertices". So, I added the code:
JanusGraphManagement mgmt = graph.openManagement();
PropertyKey title = mgmt.getPropertyKey("title");
JanusGraphManagement.IndexBuilder movieNameIndexBuilder = mgmt.buildIndex("title", Vertex.class)
.addKey(title);
movieNameIndexBuilder.unique();
JanusGraphIndex movieTitleIndex = movieNameIndexBuilder.buildCompositeIndex();
mgmt.setConsistency(movieTitleIndex, ConsistencyModifier.LOCK);
mgmt.commit();
Still I'm getting the same warning "Query requires iterating over all vertices" when I'm querying on movie title.
Thank you
Got the solution from Janusgraph gitter channel:
The index isn't available immediately if the the indexed property key was created in a previous management transaction as JanusGraph might need to reindex existing data now. That is a process you have to trigger manually. You can read more about this in the chapter Index Management of the docs.
That is why it's recommended to create all indices in the same transaction where you create the property keys if possible

How can I list database tables in a set of databases using the new DBCPConnectionPoolLookup in NiFi?

As of NiFi 1.7.1, the new DBCPConnectionPoolLookup enables dynamic selection of database connections: set an attribute database.name on a FlowFile and when a consuming processor accesses a configured DBCPConnectionPoolLookup controller service, the content of that attribute will be used to get a connection through this lookup's configured properties, which contain a mapping of potential values to DBCPConnectionPool controller service.
I'd like to list the tables in each database that I've configured in the lookup, but the ListDatabaseTables processor does not accept incoming FlowFiles. This seems to mean that it's not usable for listing tables in a dynamic set of databases.
What is the best way to accomplish this?
ListDatabaseTables uses the JDBC API for getting table info from the metadata of an established JDBC connection. This hides the underlying method of how to actually get tables from a particular database.
If all your databases are of the same ilk, then if you have a list of databases, you could generate flow files with one per database, filling in the database.name attribute, then using ExecuteSQL with the DBCPConnectionPoolLookup to execute the corresponding SQL statement to get the tables for that database, such as SHOW TABLES. You can parse the records using any of the record-aware processors such as QueryRecord, UpdateRecord, ConvertRecord, etc. and if you need one table per flow file you can use SplitRecord. If the output is JSON or CSV or XML, you could use EvaluateJsonPath, ExtractText, or EvaluateXPath respectively to get the table name into an attribute, and continue on from there.
I wrote up NIFI-5519 to cover the proposal for ListDatabaseTables to optionally accept incoming connections, in the meantime you'd need 1 ListDatabaseTables instance to correspond to each of your DBCPConnectionPool instances.

No response with a query by ID on Azure DocumentDB

I'm currently facing very slow/ no response on a collection looking by ID. I have ~ 2 milion of documents in a partitioned collection. If lookup the document using the partitionKey and id the response is immediate
SELECT * FROM c WHERE c.partitionKey=123 AND c.id="20566-2"
if I try using only the id
SELECT * FROM c WHERE c.id="20566-2"
the response never returns, java client seems freezed and I have the same situation using the Data Explorer from Azure Portal. I tried also looking up by another field that isn't the id or the partitionKey and the response always returns. When I try the select from Java client I always set the flag to enable cross partition query.
The next thing to try is to avoid the character "-" in the ID to test if this character blocks the query (anyway I didn't find anything on the documentation)
The issue is related to your Java code. Due to Azure DocumentDB Java SDK wrapped the DocumentDB REST APIs, according to the reference of REST API Query Documents, as #DanCiborowski-MSFT said, the header x-ms-documentdb-query-enablecrosspartition explains your issue reason as below.
Header: x-ms-documentdb-query-enablecrosspartition
Required/Type: Optional/Boolean
Description: If the collection is partitioned, this must be set to True to allow execution across multiple partitions. Queries that filter against a single partition key, or against single-partitioned collections do not need to set the header.
So you need to set True to enable cross partition for querying across multiple partitions without a partitionKey in where clause via pass a instance of class FeedOption to the method queryDocuments, as below.
FeedOptions queryOptions = new FeedOptions();
queryOptions.setEnableCrossPartitionQuery(true); // Enable query across multiple partitions
String collectionLink = collection.getSelfLink();
FeedResponse<Document> queryResults = documentClient.queryDocuments(
collectionLink,
"SELECT * FROM c WHERE c.id='20566-2'", queryOptions);

How to query two sesame repositories at a time?

I have two repositories in sesame, where one have whole data and one have data with few fields which links to primary data.
Example:
Primary Data Fields:
uri, skos:prefLabel , skosAltLabel etc.
Secondary Data Fields:
uri, customField
So basically i want to query secondary data on customField which will return uri which can be mapped to primary data and get other details.
So they are Linked data-sets.
So is it possible to query linked repositories which are on sesame at a time?
Using SPARQL 1.1 SERVICE queries
SPARQL 1.1 supports the SERVICE clause, which allows you to combine results from multiple SPARQL endpoints in a single query result. Because Sesame Server exposes every repository as a SPARQL endpoint, you can use this to do queries over multiple repositories.
For example, say you have a Sesame Server running at http://localhost:8080/openrdf-sesame with two repositories, Primary and Secondary. The SPARQL query endpoints for both repositories are http://localhost:8080/openrdf-sesame/repositories/Primary and http://localhost:8080/openrdf-sesame/repositories/Secondary, respectively.
You can execute a SPARQL on one repository (say, Primary) which then in the query refers to the other one, like this:
SELECT *
WHERE {
# data from Primary dataset
?uri a skos:Concept ;
skos:prefLabel ?prefLabel ;
skos:altLabel ?altLabel .
# data from Secondary dataset
SERVICE <http://localhost:8080/openrdf-sesame/repositories/Secondary> {
?uri :customField ?customFieldValue .
}
}
Using Sesame's FederationSail
An alternative is to set up a Federated repository in Sesame, using the FederationSail. This a way to group several Sesame databases together to form a "virtual" repository - a Federation. You can execute queries on the Federation and the result will include data from all member databases of the Federation (without the need to specify which endpoints you want to query, like you need to do when using a SERVICE clause).
A Federation can be set up programmatically, or (if you're using Sesame Server and Workbench) via the Workbench. Just choose 'New repository', and pick the 'Federation store' option in the store type drop-down. Give it an id and a description, then on the next screen you get to pick which databases should be part of the federation.

How to store all user activites in a website..?

I have a web application build in Django + Python that interact with web services (written in JAVA).
Now all the database management part is done by web-services i.e. all CRUD operations to actual database is done by web-services.
Now i have to track all User Activities done on my website in some log table.
Like If User posted a new article, then a new row is created into Articles table by web-services and side by side, i need to add a new row into log table , something like "User : Raman has posted a new article (with ID, title etc)"
I have to do this for all Objects in my database like "Article", "Media", "Comments" etc
Note : I am using PostgreSQL
So what is the best way to achieve this..?? (Should I do it in PostgreSQL OR JAVA ..??..And How..??)
So, you have UI <-> Web Services <-> DB
Since the web services talk to the DB, and the web services contain the business logic (i.e. I guess you validate stuff there, create your queries and execute them), then the best place to 'log' activities is in the services themselves.
IMO, logging PostgreSQL transactions is a different thing. It's not the same as logging 'user activities' anymore.
EDIT: This still means you create DB schema for 'logs' and write them to DB.
Second EDIT: Catching log worthy events in the UI and then logging them from there might not be the best idea either. You will have to rewrite logging if you ever decide to replace the UI, or for example, write an alternate UI for, say mobile devices, or something else.
For an audit table within the DB itself, have a look at the PL/pgSQL Trigger Audit Example
This logs every INSERT, UPDATE, DELETE into another table.
In your log table you can have various columns, including:
user_id (the user that did the action)
activity_type (the type of activity, such as view or commented_on)
object_id (the actual object that it concerns, such as the Article or Media)
object_type (the type of object; this can be used later, in combination with object_id to lookup the object in the database)
This way, you can keep track of all actions the users do. You'd need to update this table whenever something happens that you wish to track.
Whenever we had to do this, we overrode signals for every model and possible action.
https://docs.djangoproject.com/en/dev/topics/signals/
You can have the signal do whatever you want, from injecting some HTML into the page, to making an entry in the database. They're an excellent tool to learn to use.
I used django-audit-log and I am very satisfied.
Django-audit-log can track multiple models each in it's own additional table. All of these tables are pretty unified, so it should be fairly straightforward to create a SQL view that shows data for all models.
Here is what I've done to track a single model ("Pauza"):
class Pauza(models.Model):
started = models.TimeField(null=True, blank=False)
ended = models.TimeField(null=True, blank=True)
#... more fields ...
audit_log = AuditLog()
If you want changes to show in Django Admin, you can create an unmanaged model (but this is by no means required):
class PauzaAction(models.Model):
started = models.TimeField(null=True, blank=True)
ended = models.TimeField(null=True, blank=True)
#... more fields ...
# fields added by Audit Trail:
action_id = models.PositiveIntegerField(primary_key=True, default=1, blank=True)
action_user = models.ForeignKey(User, null=True, blank=True)
action_date = models.DateTimeField(null=True, blank=True)
action_type = models.CharField(max_length=31, choices=(('I', 'create'), ('U', 'update'), ('D', 'delete'),), null=True, blank=True)
pauza = models.ForeignKey(Pauza, db_column='id', on_delete=models.DO_NOTHING, default=0, null=True, blank=True)
class Meta:
db_table = 'testapp_pauzaauditlogentry'
managed = False
app_label = 'testapp'
Table testapp_pauzaauditlogentry is automatically created by django-audit-log, this merely creates a model for displaying data from it.
It may be a good idea to throw in some rude tamper protection:
class PauzaAction(models.Model):
# ... all like above, plus:
def save(self, *args, **kwargs):
raise Exception('Permission Denied')
def delete(self, *args, **kwargs):
raise Exception('Permission Denied')
As I said, I imagine you could create a SQL view with the four action_ fields and an additional 'action_model' field that could contain varchar references to model itself (maybe just the original table name).

Categories

Resources