I have a huge graphdatabase with authors, which are connected to papers and papers a connected to nodes which contains meta information of the paper.
I tried to select authors which match a specific pattern and therefore I executed the following cypher statement in java.
String query = "MATCH (n:AUTHOR) WHERE n.name =~ '(?i).*jim.*' RETURN n";
db.execute(query);
I get a resultSet with all "authors" back. But the execution is very slow. Is it, because Neo4j writes the result into the memory?
If I try to find nodes with the Java API, it is much faster. Of course, I am only able to search for the exact name like the following code example, but it is about 4 seconds faster as the query above. I tested it on a small database with about 50 nodes, whereby only 6 of the nodes are authors. The six author are also in the index.
db.findNodes(NodeLabel.AUTHOR, NodeProperties.NAME, "jim knopf" );
Is there a chance to speed up the cypher? Or a possiblity to get all nodes via Java API and the findNodes() method, which match a given pattern?
Just for information, I created the index for the name of the author in java with graph.schema().indexFor(NodeLabel.AUTHOR).on("name").create();
Perhaps somebody could help. Thanks in advance.
EDIT:
I run some tests today. If I execute the query PROFILE MATCH (n:AUTHOR) WHERE n.name = 'jim seroka' RETURN n; in the browser interface, I have only the operator NodeByLabelScan. It seems to me, that Neo4j does not automatic use the index (Index for name is online). If I use a the specific index, and execute the query PROFILE MATCH (n:AUTHOR) USING INDEX n:AUTHOR(name) WHERE n.name = 'jim seroka' RETURN n; the index will be used. Normally Neo4j should use automatically the correct index. Is there any configuration to set?
I also did some testing in the embedded mode again, to check the performance of the query in the embedded mode. I tried to select the author "jim seroka" with db.findNode(NodeLabel.AUTHOR, "name", "jim seroka");. It works, and it seems to me that the index is used, because of a execution time of ~0,05 seconds.
But if I run the same query, as I executed in the interface and mentioned before, using a specific index, it takes ~4,9 seconds. Why? I'm a little bit helpless. The database is local and there are only 6 authors. Is the connector slow or is the creation of connection wrong? OK, findNode() does return just a node and execute a whole Result, but four seconds difference?
The following source code should show how the database will be created and the query is executed.
public static GraphDatabaseService getNeo4jDB() {
....
return new GraphDatabaseFactory().newEmbeddedDatabase(STORE_DIR);
}
private Result findAuthorNode(String searchValue) {
db = getNeo4jDB();
String query = "MATCH (n:AUTHOR) USING INDEX n:AUTHOR(name) WHERE n.name = 'jim seroka' RETURN n";
return db.execute(query);
}
Your query uses a regular expression and therefore is not able to use an index:
MATCH (n:AUTHOR) WHERE n.name =~ '(?i).*jim.*' RETURN n
Neo4j 2.3 introduced index supported STARTS WITH string operator so this query would be very performant:
MATCH (n:Author) WHERE n.name STARTS WITH 'jim' RETURN n
Not quite the same as the regular expression, but will have better performance.
Related
Hey I'm currently writing a Java application that accesses NEO4J via Spring Neo4J Driver.
I have a couple of nodes with arrays. No I'm trying to write an cypher Query that deletes an Element from an array of a matched node. If the element was the last one i would love to delete the complete node. To achive this I'm using apoc.do.when. You can find a simplified version of my query below.
MATCH (n:NODE) WHERE "Peter" IN n.NAMES
CALL apoc.do.when(size(n.NAMES) > 1, 'SET n.NAMES = [x IN n.NAMES WHERE x <> "Peter"]', 'DETACH DELETE n') YIELD value
RETURN value
My query is overall working fine, but I don't get the Result summary back anymore in my Java application.
I'm calling the query the following way:
ResultSummary output = driver.session().run(query.withParameters(params)).consume();
I know that the query is executed and deleting a node. I validated that by Neo4J browser, but the result summary says:
serverInfo=InternalServerInfo{address='localhost:7687', version='Neo4j/3.5.17'}, databaseInfo=InternalDatabaseInfo{name='null'}, queryType=READ_WRITE, counters=null, plan=null, profile=null, notifications=[], resultAvailableAfter=143, resultConsumedAfter=1}
Updates: 0
Delete: 0
Therefore i can not validate from my Javacode if the operation was successful. I assume that apoc.do.when does not promote the result summary from the query correctly. Is there anyway to fix this or do i need to validate this with a second query?
You can modify the Cypher query to return a result that indicates whether an update or deletion occurred. For example:
MATCH (n:NODE) WHERE "Peter" IN n.NAMES
CALL apoc.do.when(
size(n.NAMES) > 1,
'SET n.NAMES = [x IN n.NAMES WHERE x <> "Peter"] RETURN "updated" AS res',
'DETACH DELETE n RETURN "deleted" AS res') YIELD value
RETURN value.res AS res
Then your Java client can iterate through (or stream) the resulting records and count the number of updates and deletions.
My scenario is like this.
Solr Indexing happens for a product and then product approval status is made unapproved from backoffice. After then, when you search the related words that is placed in description of the product or directly product code from website, you get a server error since the product that is made unapproved is still placed in solr.
If you perform any type of indexing from backoffice manually, it works again. But it is not a good solution since there might be lots of products whose status is changed or that is not a solution which happens instantly. If you use cronjob for indexing, that is not a fast solution again.You get server error until cronjob starts to work.
I would like to update solr index instantly for the attributes which changes frequently like price, status, etc. For instance, when an attribute changes, Is it a good way to start partial index immediately in java code? If it is, how? (by IndexerService?). For another solution, Is it a better idea to make http request to solr for the attribute?
In summary, I am looking for the best solution to perform partial index.
Any ideas?
For this case you need to write two new important SOLR-Configuration parts:
1) A new SOLR-Cronjob that trigger the indexing
2) A new SOLR-IndexerQuery for indexing with your special requirements.
When you have a look at the default stuff from hybris you see:
INSERT_UPDATE CronJob;code[unique=true];job(code);singleExecutable;sessionLanguage(isocode);active;
;backofficeSolrIndexerUpdateCronJob;backofficeSolrIndexerUpdateJob;false;en;false;
INSERT Trigger;cronJob(code);active;activationTime;year;month;day;hour;minute;second;relative;weekInterval;daysOfWeek;
;backofficeSolrIndexerUpdateCronJob;true;;-1;-1;-1;-1;-1;05;false;0;;
This part above is to configure when the job should run. You can modify him, that he should run ever 5 seconds for example.
INSERT_UPDATE SolrIndexerQuery; solrIndexedType(identifier)[unique = true]; identifier[unique = true]; type(code); injectCurrentDate[default = true]; injectCurrentTime[default = true]; injectLastIndexTime[default = true]; query; user(uid)
; $solrIndexedType ; $solrIndexedType-updateQuery ; update ; false ; false ; false ; "SELECT DISTINCT {PK} FROM {Product AS p JOIN VariantProduct AS vp ON {p.PK}={vp.baseProduct} } WHERE {p.modifiedtime} >= ?lastStartTimeWithSuccess OR {vp.modifiedtime} >= ?lastStartTimeWithSuccess" ; admin
The second part here is the more important. Here you define which products should be indexed. Here you can see that the UPDATE-Job is looking for every Product that was modified. Here you could write a new FlexibleSearch with your special requirements.
tl;tr Answear: You have to write a new performant solrIndexerQuery that could be trigger every 5 seconds
I'm using neo4j-jdbc 2.3.2 as my neo4j client for java. When I executed following cypher query
match(p:Person) where p.id_number='761201948V' return p.id; it will return P2547228 as node id. I feel like id is same as other properties of the node as I can use it inside where clauses. But here I'm expecting an integer which can use inside this query START p=node('node.id') return p; Is this id is an internal thing to neo4j db? and is there a way to retrieve this id?
From the following two cyphers what is most efficient one?(if both referring same node)
START p=node('2547223') return p;
match(p:Person) where p.id='P2547228' return p;
You have to use the ID(x) function for this. Note, that ID(x) and x.id are a complete different thing. The former returns the internal node/relationship id managed by Neo4j itself. The latter gives the id property which is managed by the user and not by the database itself.
Also note, that a node/relationship ID is always numeric.
Using START is pretty much old school and shouldn't be used any more (except for accessing manual indexes):
start p=node(2547228) return p
This one is a equivalent statement. It is highly efficient since it just needs to do a simple seek operation on the node store:
match(p:Person) where id(p)=2547228 return p;
Looking for a property requires either a node label scan or a schema index lookup:
match(p:Person) where p.id=2547228 return p;
Just check out the query plans on your own by prefixing the statement with PROFILE.
Here I have created a manual/legacy index and added some nodes with certain properties into it.
IndexManager indexy = graphdb.index();
Index<Node>indexery = indexy.forNodes("Main_Twitter_Index");
indexery.add(one,"Name",one.getProperty("Name"));
indexery.add(one,"Email",one.getProperty("Email"));
indexery.add(four,"Name",four.getProperty("Name"));
indexery.add(four,"Email",four.getProperty("Email"));
Now, to query the nodes of that index neo4j suggests query, which uses a key-value pair binding. My question is can I query the same nodes added into the manual index using a simple cypher query like,
START n=node:Main_Twitter_Index(Name = 'Akina')
RETURN n
Which version of Neo4j are you using? The method you describe is the typical index search for anything before 2.0, before they added schema indexing. Your query should work, even in 2.0. Are you having problems running it?
This question is similar to these two: 16283441, 15456345.
UPDATE: here's a database dump.
In a db of 190K nodes and 727K relationships (and 128MB of database disk usage), I'd like to run the following query:
START start_node=node(<id>)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) as num_partner_partner_links;
In this db 90% of the nodes have 0 relationships, and the remaining 10% have from 1 up to 670, so the biggest network this query can return cannot possibly have more than 220K links (670*670)/2).
On nodes with less than 10K partner_partner_links the query takes 2-4 seconds, when wormed up.
For more connected nodes (20-45K links) it takes about 40-50sec (don't know how much it'd take for the most connected ones).
Specifying relationship direction helps a bit but not much (but then the query doesn't return what I need it to return).
Profiling the query on one of the biggest nodes says:
==> ColumnFilter(symKeys=[" INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6"], returnItemNames=["num_partner_links"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["( INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6,Distinct)"], _rows=1, _db_hits=0)
==> PatternMatch(g="(partner)-['r']-(start_node)", _rows=97746, _db_hits=34370048)
==> TraversalMatcher(trail="(start_node)-[ UNNAMED3:COOCCURS_WITH WHERE true AND true]-(another_partner)-[s:COOCCURS_WITH WHERE true AND true]-(partner)", _rows=116341, _db_hits=117176)
==> ParameterPipe(_rows=1, _db_hits=0)
neo4j-sh (0)$
I don't see why would this be so slow, most of the stuff should be in the RAM anyway. Is it possible to have this in under 100ms or neo4j is not up to that? I could put up the whole db somewhere if that would help..
The biggest puzzle is that the same query runs slower when rewritten to use with different node symbols :)
START n=node(36)
MATCH (n)-[r:COOCCURS_WITH]-(m),
(m)-[s:COOCCURS_WITH]-(p)-[:COOCCURS_WITH]-(n)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
The former always runs in +4.2 seconds, and latter in under 3.8, no matter how many times I run one and another (interleaved)!?
SW/HW details: (advanced) Neo4j v1.9.RC2, JDK 1.7.0.10, a macbook pro with an SSD disk, 8GBRAM, 2 core i7, with the following neo4j config:
neostore.nodestore.db.mapped_memory=550M
neostore.relationshipstore.db.mapped_memory=540M
neostore.propertystore.db.mapped_memory=690M
neostore.propertystore.db.strings.mapped_memory=430M
neostore.propertystore.db.arrays.mapped_memory=230M
neostore.propertystore.db.index.keys.mapped_memory=150M
neostore.propertystore.db.index.mapped_memory=140M
wrapper.java.initmemory=4092
wrapper.java.maxmemory=4092
Change your query to the one below. On my laptop, with significatly lower specs than yours, the execution time halves.
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner)
WITH start_node, partner
MATCH (partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
Also, using your settings doesn't affect performance much as compared to the default settings. I'm afraid that you can't get the performance you want, but this query is a step in the right direction.
Usually the traversal API will be faster than Cypher because you explicitly control the traversal. I've mimicked the query as follows:
public class NeoTraversal {
public static void main(final String[] args) {
final GraphDatabaseService db = new GraphDatabaseFactory()
.newEmbeddedDatabaseBuilder("/neo4j")
.loadPropertiesFromURL(NeoTraversal.class.getClassLoader().getResource("neo4j.properties"))
.newGraphDatabase();
final Set<Long> uniquePartnerRels = new HashSet<Long>();
long startTime = System.currentTimeMillis();
final Node start = db.getNodeById(36);
for (final Path path : Traversal.description()
.breadthFirst()
.relationships(Rel.COOCCURS_WITH, Direction.BOTH)
.uniqueness(Uniqueness.NODE_GLOBAL)
.evaluator(Evaluators.atDepth(1))
.traverse(start)) {
Node partner = start.equals(path.startNode()) ? path.endNode() : path.startNode();
for (final Path partnerPath : Traversal.description()
.depthFirst()
.relationships(Rel.COOCCURS_WITH, Direction.BOTH)
.uniqueness(Uniqueness.RELATIONSHIP_PATH)
.evaluator(Evaluators.atDepth(2))
.evaluator(Evaluators.includeWhereEndNodeIs(start))
.traverse(partner)) {
uniquePartnerRels.add(partnerPath.relationships().iterator().next().getId());
}
}
System.out.println("Execution time: " + (System.currentTimeMillis() - startTime));
System.out.println(uniquePartnerRels.size());
}
static enum Rel implements RelationshipType {
COOCCURS_WITH
}
}
This clearly outperforms the cypher query, thus this might be a good alternative for you. Optimization is likely still possible.
Seems like for anything but depth/breadth first traversal, neo4j is not that "blazing fast". I've solved the problem by precomputing all the networks and storing them into MongoDB. A node document describing a network looks like this:
{
node_id : long,
partners : long[],
partner_partner_links : long[]
}
Partners and partner_partner_links are ids of documents describing egdes. Fetching the whole network takes 2 queries: one for this document, and another for edge properties (which also holds node properties):
db.edge.find({"_id" : {"$in" : network.partner_partner_links}});