This question is similar to these two: 16283441, 15456345.
UPDATE: here's a database dump.
In a db of 190K nodes and 727K relationships (and 128MB of database disk usage), I'd like to run the following query:
START start_node=node(<id>)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) as num_partner_partner_links;
In this db 90% of the nodes have 0 relationships, and the remaining 10% have from 1 up to 670, so the biggest network this query can return cannot possibly have more than 220K links (670*670)/2).
On nodes with less than 10K partner_partner_links the query takes 2-4 seconds, when wormed up.
For more connected nodes (20-45K links) it takes about 40-50sec (don't know how much it'd take for the most connected ones).
Specifying relationship direction helps a bit but not much (but then the query doesn't return what I need it to return).
Profiling the query on one of the biggest nodes says:
==> ColumnFilter(symKeys=[" INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6"], returnItemNames=["num_partner_links"], _rows=1, _db_hits=0)
==> EagerAggregation(keys=[], aggregates=["( INTERNAL_AGGREGATE48d9beec-0006-4dae-937b-9875f0370ea6,Distinct)"], _rows=1, _db_hits=0)
==> PatternMatch(g="(partner)-['r']-(start_node)", _rows=97746, _db_hits=34370048)
==> TraversalMatcher(trail="(start_node)-[ UNNAMED3:COOCCURS_WITH WHERE true AND true]-(another_partner)-[s:COOCCURS_WITH WHERE true AND true]-(partner)", _rows=116341, _db_hits=117176)
==> ParameterPipe(_rows=1, _db_hits=0)
neo4j-sh (0)$
I don't see why would this be so slow, most of the stuff should be in the RAM anyway. Is it possible to have this in under 100ms or neo4j is not up to that? I could put up the whole db somewhere if that would help..
The biggest puzzle is that the same query runs slower when rewritten to use with different node symbols :)
START n=node(36)
MATCH (n)-[r:COOCCURS_WITH]-(m),
(m)-[s:COOCCURS_WITH]-(p)-[:COOCCURS_WITH]-(n)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner),
(partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
The former always runs in +4.2 seconds, and latter in under 3.8, no matter how many times I run one and another (interleaved)!?
SW/HW details: (advanced) Neo4j v1.9.RC2, JDK 1.7.0.10, a macbook pro with an SSD disk, 8GBRAM, 2 core i7, with the following neo4j config:
neostore.nodestore.db.mapped_memory=550M
neostore.relationshipstore.db.mapped_memory=540M
neostore.propertystore.db.mapped_memory=690M
neostore.propertystore.db.strings.mapped_memory=430M
neostore.propertystore.db.arrays.mapped_memory=230M
neostore.propertystore.db.index.keys.mapped_memory=150M
neostore.propertystore.db.index.mapped_memory=140M
wrapper.java.initmemory=4092
wrapper.java.maxmemory=4092
Change your query to the one below. On my laptop, with significatly lower specs than yours, the execution time halves.
START start_node=node(36)
MATCH (start_node)-[r:COOCCURS_WITH]-(partner)
WITH start_node, partner
MATCH (partner)-[s:COOCCURS_WITH]-(another_partner)-[:COOCCURS_WITH]-(start_node)
RETURN COUNT(DISTINCT s) AS num_partner_partner_links;
Also, using your settings doesn't affect performance much as compared to the default settings. I'm afraid that you can't get the performance you want, but this query is a step in the right direction.
Usually the traversal API will be faster than Cypher because you explicitly control the traversal. I've mimicked the query as follows:
public class NeoTraversal {
public static void main(final String[] args) {
final GraphDatabaseService db = new GraphDatabaseFactory()
.newEmbeddedDatabaseBuilder("/neo4j")
.loadPropertiesFromURL(NeoTraversal.class.getClassLoader().getResource("neo4j.properties"))
.newGraphDatabase();
final Set<Long> uniquePartnerRels = new HashSet<Long>();
long startTime = System.currentTimeMillis();
final Node start = db.getNodeById(36);
for (final Path path : Traversal.description()
.breadthFirst()
.relationships(Rel.COOCCURS_WITH, Direction.BOTH)
.uniqueness(Uniqueness.NODE_GLOBAL)
.evaluator(Evaluators.atDepth(1))
.traverse(start)) {
Node partner = start.equals(path.startNode()) ? path.endNode() : path.startNode();
for (final Path partnerPath : Traversal.description()
.depthFirst()
.relationships(Rel.COOCCURS_WITH, Direction.BOTH)
.uniqueness(Uniqueness.RELATIONSHIP_PATH)
.evaluator(Evaluators.atDepth(2))
.evaluator(Evaluators.includeWhereEndNodeIs(start))
.traverse(partner)) {
uniquePartnerRels.add(partnerPath.relationships().iterator().next().getId());
}
}
System.out.println("Execution time: " + (System.currentTimeMillis() - startTime));
System.out.println(uniquePartnerRels.size());
}
static enum Rel implements RelationshipType {
COOCCURS_WITH
}
}
This clearly outperforms the cypher query, thus this might be a good alternative for you. Optimization is likely still possible.
Seems like for anything but depth/breadth first traversal, neo4j is not that "blazing fast". I've solved the problem by precomputing all the networks and storing them into MongoDB. A node document describing a network looks like this:
{
node_id : long,
partners : long[],
partner_partner_links : long[]
}
Partners and partner_partner_links are ids of documents describing egdes. Fetching the whole network takes 2 queries: one for this document, and another for edge properties (which also holds node properties):
db.edge.find({"_id" : {"$in" : network.partner_partner_links}});
Related
I'm writing an android app that supports exporting the app database to various formats. I don't want to run out of memory, but I want to page the results easily without receiving updates when it changes. So I put it in a service, and came up with the following method of paging.
I use a limit clause in my query to limit the number of results returned and I'm sorting on the primary key. So it should be fast. I use a set of nested for loops to execute the series of queries until no results are returned, and walk through the given results, so that's linear. It's in a service, so it doesn't matter that I'm using immediate result things here.
I feel like I might be doing something bad here. Am I?
// page through all results
for (List<CountedEventType> typeEvents = dao.getEventTypesPaged2(0);
typeEvents.size() > 0;
typeEvents = dao.getEventTypesPaged2(typeEvents.get(typeEvents.size() - 1).uid)
) {
for (CountedEventType type : typeEvents) {
// Do something for every result.
}
}
Here's my dao method.
#Dao
interface ExportDao {
#Query("SELECT * FROM CountedEventType WHERE uid > :lastUid ORDER BY uid ASC LIMIT 4")
List<CountedEventType> getEventTypesPaged2(int lastUid);
}
I want to use Apache Spark on my cluster which is made by 5 poor systems. At first I have implemented cassandra 3.11.3 on my nodes and all of my nodes are OK.
After that I have inserted 100k records in my nodes with a JAVA api without using Spark and all is OK too.
Now I want to execute a simple query like as follows:
select * from myKeySpace.myTbl where field1='someValue';
Since my nodes are weak in hardware, I want to get just a little records from myTbl like this:
select * from myKeySpace.myTbl where field1='someValue' limit 20;
I have tested this (A) but it is very slow (and I don't know the reason):
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue' limit 20");
and also (B) that I think Spark fetches all data and then uses limit function which is not my goal:
Dataset<Row> df1 = sparkSession.sql("select * from myKeySpace.myTbl where field1='someValue'").limit(20);
I think I can use Spark core (C) too. Also I know that a method called perPartitionLimit is implemented in cassandra 3.6 and greater (D).
As you know, since my nodes are weak, I don't want to fetch all records from cassandra table and then use limit function or something like that. I want to fetch just a little number of records from my table in such that my nodes can handle that.
So what is the best solution?
update:
I have done the suggestion which is given by #AKSW at the comment:
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.set("spark.cassandra.connection.host","192.168.107.100");
long limit=20;
JavaSparkContext jsc = new JavaSparkContext(conf);
CassandraJavaRDD<CassandraRow> rdd1 = javaFunctions(jsc)
.cassandraTable("myKeySpace", "myTbl")
.select("id").perPartitionLimit(limit);
System.out.println("Count: " + rdd1.count()); //output is "Count: 100000" which is wrong!
jsc.stop();
but perPartitionLimit(limit) that limit=20 does not work and all records fetch!
My scenario is like this.
Solr Indexing happens for a product and then product approval status is made unapproved from backoffice. After then, when you search the related words that is placed in description of the product or directly product code from website, you get a server error since the product that is made unapproved is still placed in solr.
If you perform any type of indexing from backoffice manually, it works again. But it is not a good solution since there might be lots of products whose status is changed or that is not a solution which happens instantly. If you use cronjob for indexing, that is not a fast solution again.You get server error until cronjob starts to work.
I would like to update solr index instantly for the attributes which changes frequently like price, status, etc. For instance, when an attribute changes, Is it a good way to start partial index immediately in java code? If it is, how? (by IndexerService?). For another solution, Is it a better idea to make http request to solr for the attribute?
In summary, I am looking for the best solution to perform partial index.
Any ideas?
For this case you need to write two new important SOLR-Configuration parts:
1) A new SOLR-Cronjob that trigger the indexing
2) A new SOLR-IndexerQuery for indexing with your special requirements.
When you have a look at the default stuff from hybris you see:
INSERT_UPDATE CronJob;code[unique=true];job(code);singleExecutable;sessionLanguage(isocode);active;
;backofficeSolrIndexerUpdateCronJob;backofficeSolrIndexerUpdateJob;false;en;false;
INSERT Trigger;cronJob(code);active;activationTime;year;month;day;hour;minute;second;relative;weekInterval;daysOfWeek;
;backofficeSolrIndexerUpdateCronJob;true;;-1;-1;-1;-1;-1;05;false;0;;
This part above is to configure when the job should run. You can modify him, that he should run ever 5 seconds for example.
INSERT_UPDATE SolrIndexerQuery; solrIndexedType(identifier)[unique = true]; identifier[unique = true]; type(code); injectCurrentDate[default = true]; injectCurrentTime[default = true]; injectLastIndexTime[default = true]; query; user(uid)
; $solrIndexedType ; $solrIndexedType-updateQuery ; update ; false ; false ; false ; "SELECT DISTINCT {PK} FROM {Product AS p JOIN VariantProduct AS vp ON {p.PK}={vp.baseProduct} } WHERE {p.modifiedtime} >= ?lastStartTimeWithSuccess OR {vp.modifiedtime} >= ?lastStartTimeWithSuccess" ; admin
The second part here is the more important. Here you define which products should be indexed. Here you can see that the UPDATE-Job is looking for every Product that was modified. Here you could write a new FlexibleSearch with your special requirements.
tl;tr Answear: You have to write a new performant solrIndexerQuery that could be trigger every 5 seconds
I have a huge graphdatabase with authors, which are connected to papers and papers a connected to nodes which contains meta information of the paper.
I tried to select authors which match a specific pattern and therefore I executed the following cypher statement in java.
String query = "MATCH (n:AUTHOR) WHERE n.name =~ '(?i).*jim.*' RETURN n";
db.execute(query);
I get a resultSet with all "authors" back. But the execution is very slow. Is it, because Neo4j writes the result into the memory?
If I try to find nodes with the Java API, it is much faster. Of course, I am only able to search for the exact name like the following code example, but it is about 4 seconds faster as the query above. I tested it on a small database with about 50 nodes, whereby only 6 of the nodes are authors. The six author are also in the index.
db.findNodes(NodeLabel.AUTHOR, NodeProperties.NAME, "jim knopf" );
Is there a chance to speed up the cypher? Or a possiblity to get all nodes via Java API and the findNodes() method, which match a given pattern?
Just for information, I created the index for the name of the author in java with graph.schema().indexFor(NodeLabel.AUTHOR).on("name").create();
Perhaps somebody could help. Thanks in advance.
EDIT:
I run some tests today. If I execute the query PROFILE MATCH (n:AUTHOR) WHERE n.name = 'jim seroka' RETURN n; in the browser interface, I have only the operator NodeByLabelScan. It seems to me, that Neo4j does not automatic use the index (Index for name is online). If I use a the specific index, and execute the query PROFILE MATCH (n:AUTHOR) USING INDEX n:AUTHOR(name) WHERE n.name = 'jim seroka' RETURN n; the index will be used. Normally Neo4j should use automatically the correct index. Is there any configuration to set?
I also did some testing in the embedded mode again, to check the performance of the query in the embedded mode. I tried to select the author "jim seroka" with db.findNode(NodeLabel.AUTHOR, "name", "jim seroka");. It works, and it seems to me that the index is used, because of a execution time of ~0,05 seconds.
But if I run the same query, as I executed in the interface and mentioned before, using a specific index, it takes ~4,9 seconds. Why? I'm a little bit helpless. The database is local and there are only 6 authors. Is the connector slow or is the creation of connection wrong? OK, findNode() does return just a node and execute a whole Result, but four seconds difference?
The following source code should show how the database will be created and the query is executed.
public static GraphDatabaseService getNeo4jDB() {
....
return new GraphDatabaseFactory().newEmbeddedDatabase(STORE_DIR);
}
private Result findAuthorNode(String searchValue) {
db = getNeo4jDB();
String query = "MATCH (n:AUTHOR) USING INDEX n:AUTHOR(name) WHERE n.name = 'jim seroka' RETURN n";
return db.execute(query);
}
Your query uses a regular expression and therefore is not able to use an index:
MATCH (n:AUTHOR) WHERE n.name =~ '(?i).*jim.*' RETURN n
Neo4j 2.3 introduced index supported STARTS WITH string operator so this query would be very performant:
MATCH (n:Author) WHERE n.name STARTS WITH 'jim' RETURN n
Not quite the same as the regular expression, but will have better performance.
Some of the queries we run have 100'000+ results and it takes forever to load them and then send them to the client. So I'm using ScrollableResults to have a paged results feature. But we're topping at roughly 50k results (never exactly the same amount of results).
I'm on an Oracle9i database, using the Oracle 10 drivers and Hibernate is configured to use the Oracle9 dialect. I tried with the latest JDBC driver (ojdbc6.jar) and the problem was reproduced.
We also followed some advice and added an ordering clause, but the problem was reproduced.
Here is a code snippet that illustrates what we do:
final int pageSize = 50;
Criteria crit = sess.createCriteria(ABC.class);
crit.add(Restrictions.eq("property", value));
crit.setFetchSize(pageSize);
crit.addOrder(Order.asc("property"));
ScrollableResults sr = crit.scroll();
...
...
ArrayList page = new ArrayList(pageSize);
do{
for (Object entry : page)
sess.evict(entry); //to avoid having our memory just explode out of proportion
page.clear();
for (int i =0 ; i < pageSize && ! metLastRow; i++){
if (sr.next())
page.add(sr.get(0));
else
metLastRow = true;
}
metLastRow = metLastRow?metLastRow:sr.isLast();
sendToClient(page);
}while(!metLastRow);
So, why is it that I get the result set to tell me its at the end when it should be having so much more results?
Your code snippet is missing important pieces, like the definitions of resultSet and page. But I wonder anyway, shouldn't the line
if (resultSet.next())
be rather
if (sr.next())
?
As a side note, AFAIK cleaning up superfluous objects from the persistence context could be achieved simply by calling
session.flush();
session.clear();
instead of looping through the collection of object to evict each separately. (Of course, this requires that the query is executed in its own independent session.)
Update: OK, next round of guesses :-)
Can you actually check what rows are sent to the client and compare that against the result of the equivalent SQL query directly against the DB? It would be good to know whether this code retrieves (and sends to the client all rows up to a certain limit, or only some rows (like every 2nd) from the whole resultset, or ... that could shed some light on the root cause.
Another thing you could try is
crit.setFirstResults(0).setMaxResults(200000);
As I had the same issue with a large project code based on List<E> instances,
I wrote a really limited List implementation with only iterator support to browse a ScrollableResults without refactoring all services implementations and method prototypes.
This implementation is available in my IterableListScrollableResults.java Gist
It also regularly flushes Hibernate entities from session. Here is a way to use it, for instance when exporting all non archived entities from DB as a text file with a for loop:
Criteria criteria = getCurrentSession().createCriteria(LargeVolumeEntity.class);
criteria.add(Restrictions.eq("archived", Boolean.FALSE));
criteria.setReadOnly(true);
criteria.setCacheable(false);
List<E> result = new IterableListScrollableResults<E>(getCurrentSession(),
criteria.scroll(ScrollMode.FORWARD_ONLY));
for(E entity : result) {
dumpEntity(file, entity);
}
With the hope it may help