SPARQL ARQ Query Execution - java

So I have this piece of Jena code, which basically tries to build a query using a Triple ElementTriplesBlock and finally using the QueryFactory.make(). Now I have a local Virtuoso instance set up and so my SPARQL end point is the localhost. i.e. just http://localhost:8890/sparql. The RDFs that I am querying are generated from the Lehigh University Benchmark generator. NowI am trying to replace the triples in the query pattern based on some conditions. i.e. lets say if the query is made of two BGPs or triple patterns and if one of the triple patterns gives zero results, I'd want to change that triple pattern to something else. How do I achieve this in Jena? . My code looks like
//Create your triples
Triple pattern1 = Triple.create(Var.alloc("X"),Node.createURI("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),Node.createURI("http://swat.cse.lehigh.edu/onto/univ-bench.owl#AssociateProfessor"));
Triple pattern = Triple.create(Var.alloc("X"), Node.createURI("http://swat.cse.lehigh.edu/onto/univ-bench.owl#emailAddress"), Var.alloc("Y2"));
ElementTriplesBlock block = new ElementTriplesBlock();
block.addTriple(pattern1);
block.addTriple(pattern);
ElementGroup body = new ElementGroup();
body.addElement(block);
//Build a Query here
Query q = QueryFactory.make();
q.setPrefix("ub", "http://swat.cse.lehigh.edu/onto/univ-bench.owl#");
q.setQueryPattern(body);
q.setQuerySelectType();
q.addResultVar("X");
//?X ub:emailAddress ?Y2 .
//Query to String
System.out.println(q.toString());
QueryExecution qexec = QueryExecutionFactory.sparqlService("http://localhost:8890/sparql", q);
Op op = Algebra.optimize(Algebra.compile(q));
System.out.println(op.toString());
So to be clear I am able to actually see the BGP in a Relational Algebra form by using the Op op = Algebra.optimize(Algebra.compile(q)) line. The output looks like
(project (?X)
(bgp
(triple ?X <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swat.cse.lehigh.edu/onto/univ-bench.owl#AssociateProfessor>)
(triple ?X <http://swat.cse.lehigh.edu/onto/univ-bench.owl#emailAddress> ?Y2)
))
Now how would I go about evaluating the execution of each triple? In this case, if I just wanted to print the number of results at each step of the query pattern execution, how would I do it? I did read some of the examples here. I guess one has to use an OpExecutor and a QueryIterator but I am not sure how they all fit together. In this case I just would want to iterate through each of the basic graph patterns and then output the basic graph pattern and the number of results that it returns from the end point. Any help or pointers would be appreciated.

Related

Neo4j slow cypher query in embedded mode

I have a huge graphdatabase with authors, which are connected to papers and papers a connected to nodes which contains meta information of the paper.
I tried to select authors which match a specific pattern and therefore I executed the following cypher statement in java.
String query = "MATCH (n:AUTHOR) WHERE n.name =~ '(?i).*jim.*' RETURN n";
db.execute(query);
I get a resultSet with all "authors" back. But the execution is very slow. Is it, because Neo4j writes the result into the memory?
If I try to find nodes with the Java API, it is much faster. Of course, I am only able to search for the exact name like the following code example, but it is about 4 seconds faster as the query above. I tested it on a small database with about 50 nodes, whereby only 6 of the nodes are authors. The six author are also in the index.
db.findNodes(NodeLabel.AUTHOR, NodeProperties.NAME, "jim knopf" );
Is there a chance to speed up the cypher? Or a possiblity to get all nodes via Java API and the findNodes() method, which match a given pattern?
Just for information, I created the index for the name of the author in java with graph.schema().indexFor(NodeLabel.AUTHOR).on("name").create();
Perhaps somebody could help. Thanks in advance.
EDIT:
I run some tests today. If I execute the query PROFILE MATCH (n:AUTHOR) WHERE n.name = 'jim seroka' RETURN n; in the browser interface, I have only the operator NodeByLabelScan. It seems to me, that Neo4j does not automatic use the index (Index for name is online). If I use a the specific index, and execute the query PROFILE MATCH (n:AUTHOR) USING INDEX n:AUTHOR(name) WHERE n.name = 'jim seroka' RETURN n; the index will be used. Normally Neo4j should use automatically the correct index. Is there any configuration to set?
I also did some testing in the embedded mode again, to check the performance of the query in the embedded mode. I tried to select the author "jim seroka" with db.findNode(NodeLabel.AUTHOR, "name", "jim seroka");. It works, and it seems to me that the index is used, because of a execution time of ~0,05 seconds.
But if I run the same query, as I executed in the interface and mentioned before, using a specific index, it takes ~4,9 seconds. Why? I'm a little bit helpless. The database is local and there are only 6 authors. Is the connector slow or is the creation of connection wrong? OK, findNode() does return just a node and execute a whole Result, but four seconds difference?
The following source code should show how the database will be created and the query is executed.
public static GraphDatabaseService getNeo4jDB() {
....
return new GraphDatabaseFactory().newEmbeddedDatabase(STORE_DIR);
}
private Result findAuthorNode(String searchValue) {
db = getNeo4jDB();
String query = "MATCH (n:AUTHOR) USING INDEX n:AUTHOR(name) WHERE n.name = 'jim seroka' RETURN n";
return db.execute(query);
}
Your query uses a regular expression and therefore is not able to use an index:
MATCH (n:AUTHOR) WHERE n.name =~ '(?i).*jim.*' RETURN n
Neo4j 2.3 introduced index supported STARTS WITH string operator so this query would be very performant:
MATCH (n:Author) WHERE n.name STARTS WITH 'jim' RETURN n
Not quite the same as the regular expression, but will have better performance.

Lucene: Is there any way to know which subqueries have hit the document?

I have a MemoryIndex created like this.
```
Version version = Version.LUCENE_47;
Analyzer analyzer = new SimpleAnalyzer(version);
MemoryIndex index = new MemoryIndex();
index.addField("text", "Readings about Salmons and other select Alaska fishing Manuals", analyzer);
```
Then, I have a query containing a number of sub-query which is created from a set of concepts (including id, name, description). Right now I have to loop for every concept, generate a query, and finally check if it is matched => if it is, I append it to a string which is used to store matches
```
for (Concept concept : concepts) {
Query query = queryGenerator.getQueryForConcept(concept);
float score = query != null ? index.search(query) : 0.0f;
if (score > 0) {
matches.append(sep + concept.getId() + "|" + concept.getName());
sep = "|";
}
}```
The problem is: the number of concepts is growing larger and larger, which affects the performance. Is there anyway that I can create a one single query and compare to a document, and find out what concepts have been hit the document?
I tried using BooleanQuery as a whole, then add all subquery which derrived from concept into it. It matches but don't know which subquery hits, and even if we do, how do we put the details like "id", and "name" of a concept into it?
Much appreciate all answers

Jena ARQ/TDB Query Optimization

I have a rather small graph containing roughly 500k triples. I've also generated the stats.opt file and running my code on a rather fast computer (quad core, 16gb ram, ssd drive.) But for the query I'm building with the help of the OP interface, it takes forever to iterate over the resultset. The resultset has about 15000 lines and the iteration takes 4s, which is unacceptable for endusers. Executing the query takes merely 90ms (I guess the real work is done by the cursor iteration?). Why is this so slow and what can I do to speed up the result set iteration?
Here is the query:
SELECT ?apartment ?price ?hasBalcony ?lat ?long ?label ?hasImage ?park ?supermarket ?rooms ?area ?street
WHERE
{ ?apartment dssd:hasBalcony ?hasBalcony .
?apartment wgs84:lat ?lat .
?apartment wgs84:long ?long .
?apartment rdfs:label ?label .
?apartment dssd:hasImage ?hasImage .
?apartment dssd:hasNearby ?hasNearbyPark .
?hasNearbyPark dssd:hasNearbyPark ?park .
?apartment dssd:hasNearby ?hasNearbySupermarket .
?hasNearbySupermarket dssd:hasNearbySupermarket ?supermarket .
?apartment dssd:price ?price .
?apartment dssd:rooms ?rooms .
?apartment dssd:area ?area .
?apartment vcard:hasAddress ?address .
?address vcard:streetAddress ?street
FILTER ( ?hasBalcony = true )
FILTER ( ?price <= 1000.0e0 )
FILTER ( ?price >= 650.0e0 )
FILTER ( ?rooms <= 4.0e0 )
FILTER ( ?rooms >= 3.0e0 )
FILTER ( ?area <= 100.0e0 )
FILTER ( ?area >= 60.0e0 )
}
(Is there a better way to query those bnodes: ?hasNearbyPark, ?hasNearbySupermarket)
And the code to execute the query:
dataset.begin(ReadWrite.READ);
Model model = dataset.getNamedModel("http://example.com");
QueryExecution queryExecution = QueryExecutionFactory.create(buildQuery(), model);
ResultSet resultSet = queryExecution.execSelect();
while ( resultSet.hasNext() ) {
QuerySolution solution = resultSet.next(); ...
On the ARQ Query Engine
First off you seem to be misunderstanding how the ARQ engine works:
ResultSet resultSet = queryExecution.execSelect();
All the above does is prepare a query plan for how the engine will evaluate the query, it does not actually evaluate the query hence why it is almost instantaneous.
Actual work on answering your question does not happen until you start calling hasNext() and next():
while ( resultSet.hasNext() ) {
QuerySolution solution = resultSet.next(); ...
So the timings you quote are incorrect, the query takes 4s to evaluate because that is how long it takes to iterate over all results.
On your actual question
You haven't shown what your buildQuery() method does but you say you are building the query as a Op structure programmatically rather than as a string? If this is the case then the query engine may not actually be applying optimization though off the top of my head I don't think this will be the issue. You can try adding an op = Algebra.optimize(op); before you return the built Op but I don't know that this will make much difference.
It looks like the optimizer should do a good job just given the raw query (not that your query has much scope for optimization other than join reordering) but if you are building it programmatically then you may be building an unusual algebra which the optimizer struggles with.
Similarly I'm not sure if you stats.opt file will be honored because you query over a specific model rather than the TDB dataset so the query engine might be the general purpose rather than the TDB engine. I'm not an expert in TDB so I can't tell if this is the case or not.
Bottom Line
In general there is not enough information in your question to diagnose if there is an actual issue in your setup or if your query is just plain expensive. Reporting this as a minimal test case (minimal complete code plus sample data) to the user#jena.apache.org list for further analysis would be useful.
As a general comment on your query lots of range filters are expensive to perform which is likely where most of the time goes.

Antlr AST Tree Approach To Complex Grammar

I have written a complex grammar. The grammar can be seen below:
grammar i;
options {
output=AST;
}
#header {
package com.data;
}
operatorLogic : 'AND' | 'OR';
value : STRING;
query : (select)*;
select : 'SELECT'^ functions 'FROM table' filters?';';
operator : '=' | '!=' | '<' | '>' | '<=' | '>=';
filters : 'WHERE'^ conditions;
conditions : (members (operatorLogic members)*);
members : STRING operator value;
functions : '*';
STRING : ('a'..'z'|'A'..'Z')+;
WS : (' '|'\t'|'\f'|'\n'|'\r')+ {skip();}; // handle white space between keywords
The output is done using AST. The above is only a small sample. However, I am developing some big grammar and need advice on how to approach this.
For example according to the above grammar the following can be produced:
SELECT * from table;
SELECT * from table WHERE name = i AND name = j;
This query could get more complex. I have implemented AST in the Java code and can get the Tree back. I wanted to seperate the grammar and logic, so their are cohesive. So AST was the best approach.
The user will enter a query as a String and my code needs to handle the query in the best way possible. As you can see the functions parser currently is * which means select all. In the future this could expand to include other things.
How can my code handle this? What's the best approach?
I could do something like this:
String input = "SELECT * from table;";
if(input.startsWith("SELECT")) {
select();
}
As you can see this approach is more complicated, as I need to handle * also the optional filters. The operatorLogic which is AND and OR, also needs to be done.
What is the best way? I have looked online, but couldn't find any example on how to handle this.
Are you able to give any examples?
EDIT:
String input = "SELECT * FROM table;";
if(input.startsWith("SELECT")) {
select();
}
else if(input.startsWith("SELECT *")) {
findAll();
}
The easiest way to handle multiple starting rules ("SELECT ...", "UPDATE...", etc) is to let the ANTLR grammar do the work for you at a single, top-level starting rule. You pretty much have that already, so it's just a matter of updating what you have.
Currently your grammar is limited to one command-type of input ("SELECT...") because that's all you've defined:
query : (select)*; //query only handles "select" because that's all there is.
select : 'SELECT'^ functions 'FROM table' filters?';';
If query is your starting rule, then accepting additional top-level input is a matter of defining query to accept more than select:
query : (select | update)*; //query now handles any number of "select" or "update" rules, in any order.
select : 'SELECT'^ functions 'FROM table' filters?';';
update : 'UPDATE'^ ';'; //simple example of an update rule
Now the query rule can handle input such as SELECT * FROM table;, UPDATE;, or SELECT * FROM table; UPDATE;. When a new top-level rule is added, just update query to test for that new rule. This way your Java code doesn't need to test the input, it just calls the query rule and lets the parser handle the rest.
If you only want one type of input to be processed from the input, define query like this:
query : select* //read any number of selects, but no updates
| update* //read any number of updates, but no selects
;
The rule query still handles SELECT * FROM table; and UPDATE;, but not a mix of commands, like SELECT * FROM table; UPDATE;.
Once you get your query_return AST tree from calling query, you now have something meaningful that your Java code can process, instead of a string. That tree represents all the input that the parser processed.
You can walk through the children of the tree like so:
iParser.query_return r = parser.query();
CommonTree t = (CommonTree) r.getTree();
for (int i = 0, count = t.getChildCount(); i < count; ++i) {
CommonTree child = (CommonTree) t.getChild(i);
System.out.println("child type: " + child.getType());
System.out.println("child text: " + child.getText());
System.out.println("------");
}
Walking through the entire AST tree is a matter of recursively calling getChild(...) on all parent nodes (my example above looks at the top-level children only).
Handling alternatives to * is no different than any other alternatives you've defined: just define the alternatives in the rule you want to expand. If you want functions to accept more than *, define functions to accept more than *. ;)
Here's an example:
functions: '*' //"all"
| STRING //some id
;
Now the parser can accept SELECT * FROM table; and SELECT foobar FROM table;.
Remember that your Java code has no reason to examine the input string. Whenever you're tempted to do that, look for a way to make your grammar do the examining instead. Your Java code will then look at the AST tree output for whatever it wants.

How to query mongodb with “like” using the java api without using Pattern Matching?

Currently I am using java to connect to MONGODB,
I want to write this sql query in mongodb using java driver:
select * from tableA where name like("%ab%")
is their any solution to perform the same task through java,
the query in mongodb is very simple i know, the query is
db.collection.find({name:/ab/})
but how to perform same task in java
Current I am using pattern matching to perform the task and code is
DBObject A = QueryBuilder.start("name").is(Pattern.compile("ab",
Pattern.CASE_INSENSITIVE)).get();
but it makes query very slow I think , does a solution exist that does not use pattern matching?
Can use Regular Expressions. Take a look at the following:
http://www.mongodb.org/display/DOCS/Advanced+Queries#AdvancedQueries-RegularExpressions
Make sure you understand the potential performance impacts!
DBObject A = QueryBuilder.start("name").is(Pattern.compile("ab",
Pattern.CASE_INSENSITIVE)).get();
I think this is one of the possible solution, you need to create index to achieve those.
Why do you fear the regular expressions? Once the expression is compiled they are very fast, and if the expression is "ab" the result is similar to a function that search a substring in a string.
However to do what you need you have 2 possibilities:
The first one, using regular expression, as you mention in your question. And I believe this is the best solution.
The second one, using the $where queries.
With $where queries you can specify expression like these
db.foo.find({"$where" : "this.x + this.y == 10"})
db.foo.find({"$where" : "function() { return this.x + this.y == 10; }"})
and so you can use the JavaScript .indexOf() on string fields.
Code snippet using the $regex clause (as mentioned by mikeycgto)
String searchString = "ab";
DBCollection coll = db.getCollection("yourCollection");
query.put("name",
new BasicDBObject("$regex", String.format(".*((?i)%s).*", searchString)) );
DBCursor cur = coll.find(query);
while (cur.hasNext()) {
DBObject dbObj = cur.next();
// your code to read the DBObject ..
}
As long as you are not opening and closing the connection per method call, the query should be fast.

Categories

Resources