Jena ARQ/TDB Query Optimization

Jena ARQ/TDB Query Optimization - java

I have a rather small graph containing roughly 500k triples. I've also generated the stats.opt file and running my code on a rather fast computer (quad core, 16gb ram, ssd drive.) But for the query I'm building with the help of the OP interface, it takes forever to iterate over the resultset. The resultset has about 15000 lines and the iteration takes 4s, which is unacceptable for endusers. Executing the query takes merely 90ms (I guess the real work is done by the cursor iteration?). Why is this so slow and what can I do to speed up the result set iteration?
Here is the query:
SELECT ?apartment ?price ?hasBalcony ?lat ?long ?label ?hasImage ?park ?supermarket ?rooms ?area ?street
WHERE
{ ?apartment dssd:hasBalcony ?hasBalcony .
?apartment wgs84:lat ?lat .
?apartment wgs84:long ?long .
?apartment rdfs:label ?label .
?apartment dssd:hasImage ?hasImage .
?apartment dssd:hasNearby ?hasNearbyPark .
?hasNearbyPark dssd:hasNearbyPark ?park .
?apartment dssd:hasNearby ?hasNearbySupermarket .
?hasNearbySupermarket dssd:hasNearbySupermarket ?supermarket .
?apartment dssd:price ?price .
?apartment dssd:rooms ?rooms .
?apartment dssd:area ?area .
?apartment vcard:hasAddress ?address .
?address vcard:streetAddress ?street
FILTER ( ?hasBalcony = true )
FILTER ( ?price <= 1000.0e0 )
FILTER ( ?price >= 650.0e0 )
FILTER ( ?rooms <= 4.0e0 )
FILTER ( ?rooms >= 3.0e0 )
FILTER ( ?area <= 100.0e0 )
FILTER ( ?area >= 60.0e0 )
}
(Is there a better way to query those bnodes: ?hasNearbyPark, ?hasNearbySupermarket)
And the code to execute the query:
dataset.begin(ReadWrite.READ);
Model model = dataset.getNamedModel("http://example.com");
QueryExecution queryExecution = QueryExecutionFactory.create(buildQuery(), model);
ResultSet resultSet = queryExecution.execSelect();
while ( resultSet.hasNext() ) {
QuerySolution solution = resultSet.next(); ...

On the ARQ Query Engine
First off you seem to be misunderstanding how the ARQ engine works:
ResultSet resultSet = queryExecution.execSelect();
All the above does is prepare a query plan for how the engine will evaluate the query, it does not actually evaluate the query hence why it is almost instantaneous.
Actual work on answering your question does not happen until you start calling hasNext() and next():
while ( resultSet.hasNext() ) {
QuerySolution solution = resultSet.next(); ...
So the timings you quote are incorrect, the query takes 4s to evaluate because that is how long it takes to iterate over all results.
On your actual question
You haven't shown what your buildQuery() method does but you say you are building the query as a Op structure programmatically rather than as a string? If this is the case then the query engine may not actually be applying optimization though off the top of my head I don't think this will be the issue. You can try adding an op = Algebra.optimize(op); before you return the built Op but I don't know that this will make much difference.
It looks like the optimizer should do a good job just given the raw query (not that your query has much scope for optimization other than join reordering) but if you are building it programmatically then you may be building an unusual algebra which the optimizer struggles with.
Similarly I'm not sure if you stats.opt file will be honored because you query over a specific model rather than the TDB dataset so the query engine might be the general purpose rather than the TDB engine. I'm not an expert in TDB so I can't tell if this is the case or not.
Bottom Line
In general there is not enough information in your question to diagnose if there is an actual issue in your setup or if your query is just plain expensive. Reporting this as a minimal test case (minimal complete code plus sample data) to the user#jena.apache.org list for further analysis would be useful.
As a general comment on your query lots of range filters are expensive to perform which is likely where most of the time goes.

Related

Activiti HistoricProcessInstanceQuery returned with missing processVariables

I am trying to query HistoricProcessInstances from Activiti historyService including the processVariables. But some of the processes have missing variables in the returned list. I have monitored the database to see the sql query that Activiti had been created, and it turned out, the query joins 3 tables together, and can only return 20 000 records. I have approximately 550 processes with 37 processVariables each, so that's going to be 20 350 records.
In the monitored SQL query there is a rnk (rank) created to each line in the result and its always between 1 and 20 000.
...from ACT_HI_PROCINST RES
left outer join ACT_HI_VARINST VAR ON RES.PROC_INST_ID_ = VAR.EXECUTION_ID_ and VAR.TASK_ID_ is null
inner join ACT_HI_VARINST A0 on RES.PROC_INST_ID_ = A0.PROC_INST_ID_
WHERE RES.END_TIME_ is not NULL and
A0.NAME_= 'processOwner'and
A0.VAR_TYPE_ = 'string' and
A0.TEXT_ = 'user123'
) RES
) SUB WHERE SUB.rnk >= 1 AND
SUB.rnk < 20001 and
Is there any possible solution that I can increase this threshold or create a HistoricProcessInstanceQuery with include only specific processVariables?
My code snippet for the query:
processHistories = historyService.createHistoricProcessInstanceQuery()
.processDefinitionKey(processKey).variableValueEquals(VariableNames.processOwner, username)
.includeProcessVariables().finished().orderByProcessInstanceStartTime().desc().list();

You can use NativeQuery from HistoryService.createNativeHistoricProcessInstanceQuery
enter your SQL (copy from the actual historic process instance query without ranks where clause)

Likely this is more a restriction imposed by your database than Activiti/Flowable.

Can I search by multiple fields using the Elastic Search Java API?

Example:
|studentname | maths | computers |
++++++++++++++++++++++++++++++++++
|s1 |78 |90 |
==================================
|s2 |56 |75 |
==================================
|s3 |45 |50 |
==================================
The above table represents data that is present in Elasticsearch.
Consider that the user enters 60 and above then Elasticsearch should be able to display only s1, because he is the only student who has scored more than 60 in both subjects.
How do I solve it using Java API?
NOTE: I was able to find out for individual subjects by:
QueryBuilder query = QueryBuilders.boolQuery()
.should(
QueryBuilders.rangeQuery(maths)
.from(50)
.to(100)
)

You can have multiple .should() clauses in the the bool query. So in your case:
QueryBuilder query = QueryBuilders.boolQuery()
.should(
QueryBuilders.rangeQuery(maths)
.from(61)
.to(100)
)
.should(
QueryBuilders.rangeQuery(computers)
.from(61)
.to(100)
)
Note:
RangeQueryBuilder (returned from QueryBuilders.rangeQuery()) also has the method #gte().
According to the docs: "In a boolean query with no must or filter clauses, one or more should clauses must match a document." So if you have have no filter or must, you might not get the desired behavior. The above answer assumes you are using some other clauses. The point is you can combine clauses in the bool query. With the Java API this just means using the clause repeatedly.
You might want to use filter instead of should. This will lead to faster execution and caching (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html)

SPARQL ARQ Query Execution

So I have this piece of Jena code, which basically tries to build a query using a Triple ElementTriplesBlock and finally using the QueryFactory.make(). Now I have a local Virtuoso instance set up and so my SPARQL end point is the localhost. i.e. just http://localhost:8890/sparql. The RDFs that I am querying are generated from the Lehigh University Benchmark generator. NowI am trying to replace the triples in the query pattern based on some conditions. i.e. lets say if the query is made of two BGPs or triple patterns and if one of the triple patterns gives zero results, I'd want to change that triple pattern to something else. How do I achieve this in Jena? . My code looks like
//Create your triples
Triple pattern1 = Triple.create(Var.alloc("X"),Node.createURI("http://www.w3.org/1999/02/22-rdf-syntax-ns#type"),Node.createURI("http://swat.cse.lehigh.edu/onto/univ-bench.owl#AssociateProfessor"));
Triple pattern = Triple.create(Var.alloc("X"), Node.createURI("http://swat.cse.lehigh.edu/onto/univ-bench.owl#emailAddress"), Var.alloc("Y2"));
ElementTriplesBlock block = new ElementTriplesBlock();
block.addTriple(pattern1);
block.addTriple(pattern);
ElementGroup body = new ElementGroup();
body.addElement(block);
//Build a Query here
Query q = QueryFactory.make();
q.setPrefix("ub", "http://swat.cse.lehigh.edu/onto/univ-bench.owl#");
q.setQueryPattern(body);
q.setQuerySelectType();
q.addResultVar("X");
//?X ub:emailAddress ?Y2 .
//Query to String
System.out.println(q.toString());
QueryExecution qexec = QueryExecutionFactory.sparqlService("http://localhost:8890/sparql", q);
Op op = Algebra.optimize(Algebra.compile(q));
System.out.println(op.toString());
So to be clear I am able to actually see the BGP in a Relational Algebra form by using the Op op = Algebra.optimize(Algebra.compile(q)) line. The output looks like
(project (?X)
(bgp
(triple ?X <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swat.cse.lehigh.edu/onto/univ-bench.owl#AssociateProfessor>)
(triple ?X <http://swat.cse.lehigh.edu/onto/univ-bench.owl#emailAddress> ?Y2)
))
Now how would I go about evaluating the execution of each triple? In this case, if I just wanted to print the number of results at each step of the query pattern execution, how would I do it? I did read some of the examples here. I guess one has to use an OpExecutor and a QueryIterator but I am not sure how they all fit together. In this case I just would want to iterate through each of the basic graph patterns and then output the basic graph pattern and the number of results that it returns from the end point. Any help or pointers would be appreciated.

Neo4j slow cypher query in embedded mode

I have a huge graphdatabase with authors, which are connected to papers and papers a connected to nodes which contains meta information of the paper.
I tried to select authors which match a specific pattern and therefore I executed the following cypher statement in java.
String query = "MATCH (n:AUTHOR) WHERE n.name =~ '(?i).*jim.*' RETURN n";
db.execute(query);
I get a resultSet with all "authors" back. But the execution is very slow. Is it, because Neo4j writes the result into the memory?
If I try to find nodes with the Java API, it is much faster. Of course, I am only able to search for the exact name like the following code example, but it is about 4 seconds faster as the query above. I tested it on a small database with about 50 nodes, whereby only 6 of the nodes are authors. The six author are also in the index.
db.findNodes(NodeLabel.AUTHOR, NodeProperties.NAME, "jim knopf" );
Is there a chance to speed up the cypher? Or a possiblity to get all nodes via Java API and the findNodes() method, which match a given pattern?
Just for information, I created the index for the name of the author in java with graph.schema().indexFor(NodeLabel.AUTHOR).on("name").create();
Perhaps somebody could help. Thanks in advance.
EDIT:
I run some tests today. If I execute the query PROFILE MATCH (n:AUTHOR) WHERE n.name = 'jim seroka' RETURN n; in the browser interface, I have only the operator NodeByLabelScan. It seems to me, that Neo4j does not automatic use the index (Index for name is online). If I use a the specific index, and execute the query PROFILE MATCH (n:AUTHOR) USING INDEX n:AUTHOR(name) WHERE n.name = 'jim seroka' RETURN n; the index will be used. Normally Neo4j should use automatically the correct index. Is there any configuration to set?
I also did some testing in the embedded mode again, to check the performance of the query in the embedded mode. I tried to select the author "jim seroka" with db.findNode(NodeLabel.AUTHOR, "name", "jim seroka");. It works, and it seems to me that the index is used, because of a execution time of ~0,05 seconds.
But if I run the same query, as I executed in the interface and mentioned before, using a specific index, it takes ~4,9 seconds. Why? I'm a little bit helpless. The database is local and there are only 6 authors. Is the connector slow or is the creation of connection wrong? OK, findNode() does return just a node and execute a whole Result, but four seconds difference?
The following source code should show how the database will be created and the query is executed.
public static GraphDatabaseService getNeo4jDB() {
....
return new GraphDatabaseFactory().newEmbeddedDatabase(STORE_DIR);
}
private Result findAuthorNode(String searchValue) {
db = getNeo4jDB();
String query = "MATCH (n:AUTHOR) USING INDEX n:AUTHOR(name) WHERE n.name = 'jim seroka' RETURN n";
return db.execute(query);
}

Your query uses a regular expression and therefore is not able to use an index:
MATCH (n:AUTHOR) WHERE n.name =~ '(?i).*jim.*' RETURN n
Neo4j 2.3 introduced index supported STARTS WITH string operator so this query would be very performant:
MATCH (n:Author) WHERE n.name STARTS WITH 'jim' RETURN n
Not quite the same as the regular expression, but will have better performance.

SPARQL Query Execution time using Jena on a Virtuoso service

So I am using a Virtuoso SPARQL endpoint and I am using Jena to query it. I use QueryFactory and QueryExecution to create a SPARQL query :
Query query = QueryFactory.create(sparqlQueryString1);
QueryExecution qexec = QueryExecutionFactory.sparqlService("http://localhost:8890/sparql", query);
ResultSet results = qexec.execSelect();
Now I want to calculate the time taken to run this query. How does one find such a time using Jena on Virtuoso? Is that possible? Obviously I did look at functions like getTimeOut1() and getTimeOut2(). They don't seem to be giving me any good direction. As a hack I tried using Java's inbuilt System.currentTimeMillis(), However I am not sure if that is the right way. Any pointers as to how I can find execution time would be appreciated!

Results come back as a stream so the timing needs to span from just before qexec.execSelect() to just after the app finished handling the results, not just call execSelect.
Timer timer = new Timer() ;
timer.startTimer() ;
ResultSet results = qexec.execSelect();
ResultSetFormatter.consume(results) ;
long x = timer.finishTimer() ; // Time in milliseconds.

It's not clear whether you want to time the full round-trip, or just the time Virtuoso spends on things...
Virtuoso 7 lets you get the compilation (query plan) and execution time of a query using the profile function.
You can also enable general query logging and profiling using the prof_enable function.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.