Induced subgraphs in neo4j

Induced subgraphs in neo4j - java

I have a graph in neo4j, and for a given node N, I would like to find all nodes that are reachable in a path of no longer than P steps from N, as well as all links between that set of nodes. It seems like this could be possible with either Cypher or the Traversal framework; is one preferred over the other? I'm doing this from Java with an embedded database, and I will need to perform further queries on the subgraph. I've poked around and haven't found any conclusive answers.

I think cypher is the most concise way of getting your desired data, querying for variable length paths, some collecting and refining:
If n is the internal id of your node N and your P is 5:
START begin = node(n) // or e.g. index lookup
MATCH p = (begin)<-[r*..5]-(end) // match all paths of length up to 5
WITH distinct nodes(p) as nodes // collect the nodes contained in the paths
MATCH (x)<-[r]-(y) // find all relationships between nodes
WHERE x in nodes and y in nodes // which were found earlier
RETURN distinct x,r,y // and deduplicate as you find all pairs twice
It might not be the most efficient way, but at least the execution plan explanation in http://console.neo4j.org/ suggests that the y in nodes is considered before the MATCH (x)-[r]-(y).
I couldn't think of a way to avoid matching the relationships twice, therefore the distinct in the return statement.

Related

Tinkerpop/Gremlin: select vertices together with outgoing edge count

I try to find an efficient gremlin query that returns a traversal with the vertex and the number of outgoing edges. Or even better instead of the number of outgoing edges a boolean value if outgoing edges exist or not.
Background: I try to improve the performance of a program that writes some properties on the vertices and then iterates over the outgoing edges to remove some of it. In a lot of cases there are no outgoing edges and the iteration
for (Iterator<Edge> iE = v.edges(Direction.OUT); iE.hasNext();) { ... } consumes a significant part of the runtime. So instead of resolving the ids to vertices (with gts.V(ids) I want to collect the information about the existence of outgoing edges to skip the iteration, if possible.
My first try was:
gts.V(ids).as("v").choose(__.outE(), __.constant(true), __.constant(false)).as("e").select("v", "e");
Second idea was:
gts.V(ids).project("v", "e").by().by(__.outE().count());
Both seem to work, but is there a better solution that does not require the underlying graph implementation to fetch or count all edges?
(We currently use the sqlg implementation of tinkerpop/gremlin with Postgresql and both queries seem to fetch all outgoing edges from Postgresql. This may be a case where some optimization is missing. But my question is not sqlg specific.)

If you only need to know whether edges exist or not then you should limit() results in the by() modulator:
gremlin> g.V().project('v','e').by().by(outE().limit(1).count())
==>[v:v[1],e:1]
==>[v:v[2],e:0]
==>[v:v[3],e:0]
==>[v:v[4],e:1]
==>[v:v[5],e:0]
==>[v:v[6],e:1]
In this way you don't count all of the edges, just the first which is enough to answer your question. You can do true and false if you like with a minor modification:
gremlin> g.V().project('v','e').by().by(coalesce(outE().limit(1).constant(true),constant(false)))
==>[v:v[1],e:true]
==>[v:v[2],e:false]
==>[v:v[3],e:false]
==>[v:v[4],e:true]
==>[v:v[5],e:false]
==>[v:v[6],e:true]

How to compare all elements between two arrays?

I'm attempting to write a program which can identify all nodes in a graph that don't share any common neighbors, and in which all vertices are contained within the various subsystems in the graph. Assume all nodes are numerically labeled for simplicity.
For example, in a graph of a cube, the furthest corners share no common nodes and are part of subsystems that together contain all vertices.
I'm looking to write a program that compares each potential subsystem against all other potential subsystems, regardless of the graph, number of nodes or sides, and finds groups of subsystems whose central nodes don't share common neighbors. For simplicity's sake, assume the graphs aren't usually symmetrical, unlike the cube example, as this introduces functionally equivalent systems. However, the number of nodes in a subsystem, or elements in an array, can vary.
The goal for the program is to find a set of central nodes whose neighbors are each unique to the subsystem, that is no neighbor appears in another subsystem. This would also mean that the total number of nodes in all subsystems, central nodes and neighbors together, would equal the total number of vertices in the graph.
My original plan was to use a 2d array, where rows act as stand-ins for the subsystems. It would compare individual elements in an array against all other elements in all other arrays. If two arrays contain no similar elements, then index the compared array and its central node is recorded, otherwise it is discarded for this iteration. After the program has finished iterating through the 2d array against the first row, it adds up the number of elements from all recorded rows to see if all nodes in the graph are represented. So if a graph contains x nodes, and the number of elements in the recorded rows is less than x, then the program iterates down one row to the next subsystem and compares all values in the row against all other values like before.
Eventually, this system should print out which nodes can make up a group of subsystems that encompass all vertices and whose central nodes share no common neighbors.
My limited exposure to CS makes a task like this daunting, as it's just my way of solving puzzles presented by my professor. I'd find the systems by hand through guess-and-check methods, but with a 60+ node array...
Thanks for any help, and simply pointers in the right direction would be very much appreciated.

I don't have a good solution (and maybe there exists none; it sounds very close to vertex cover). So you may need to resort backtracking. The idea is the following:
Keep a list of uncovered vertices and a list of potential central node candidates. Both lists initially contain all vertices. Then start placing a random central node from the candidate list. This will erase the node and its one-ring from the uncovered list and the node plus its one-ring and two-ring from the candidate list. Do this until the uncovered list is empty or you run out of candidates. If you make a mistake, revert the last step (and possibly more).
In pseudo-code, this looks as follows:
findSolution(uncoveredVertices : list, centralNodeCandidates : list, centralNodes : list)
if uncoveredVertices is empty
return centralNodes //we have found a valid partitioning
if centralNodeCandidates is empty
return [failure] //we cannot place more central nodes
for every n in centralNodeCandidates
newUncoveredVertices <- uncoveredVertices \ { n } \ one-ring of n
newCentralNodeCandidates <- centralNodeCandidates \ { n } \ one-ring of n \ two-ring of n
newCentralNodes = centralNodes u { n }
subProblemSolution = findSolution(newUncoveredVertices, newCentralNodeCandidates, newCentralNodes)
if subProblemSolution is not [failure]
return subProblemSolution
next
return [failure] //none of the possible routes to go yielded a valid solution
Here, \ is the set minus operator and u is set union.
There are several possible optimizations:
If you know the maximum number of nodes, you can represent the lists as bitmaps (for a maximum of 64 nodes, this even fits into a 64 bit integer).
You may end up checking the same state multiple times. To avoid this, you may want to cache the states that resulted in failures. This is in the spirit of dynamic programming.

gremlin query to retrieve vertices which are having multiple edges between them

Consider the above graph. I would like a gremlin query that returns all nodes that have multiple edges between them as shown in the graph.
this graph was obtained using neo4j cypher query:
MATCH (d:dest)-[r]-(n:cust)
WITH d,n, count(r) as popular
RETURN d, n
ORDER BY popular desc LIMIT 5
for example:
between RITUPRAKA... and Asia there are 8 multiple edges hence the query has returned the 2 nodes along with the edges, similarly for other nodes.
Note: the graph has other nodes with only a single edge between them, these nodes will not be returned.
I would like same thing in gremlin.
I have used given below query
g.V().as('out').out().as('in').select('out','in').groupCount().unfold().filter(select(values).is(gt(1))).select(keys)
it is displaying
out:v[1234],in:v[3456] .....
but instead of displaying Ids of the node I want to display values of the node
like out:ICIC1234,in:HDFC234
I have modified query as
g.V().values("name").as('out').out().as('in').values("name").select('out','in').
groupCount().unfold().filter(select(values).is(gt(1))).select(keys)
but it is showing the error like classcastException, each vertex to be traversed use indexes for fast iteration

Your graph doesn't seem to indicate bi-directional edges are possible so I will answer with that assumption in mind. Here's a simple sample graph - please consider including one on future questions as it makes it much easier than pictures and textual descriptions for those reading your question to understand and to get started writing a Gremlin traversal to help you:
g.addV().property(id,'a').as('a').
addV().property(id,'b').as('b').
addV().property(id,'c').as('c').
addE('knows').from('a').to('b').
addE('knows').from('a').to('b').
addE('knows').from('a').to('c').iterate()
So you can see that vertex "a" has two outgoing edges to "b" and one outgoing edge to "c", thus we should get the "a b" vertex pair. One way to get this is with:
gremlin> g.V().as('out').out().as('in').
......1> select('out','in').
......2> groupCount().
......3> unfold().
......4> filter(select(values).is(gt(1))).
......5> select(keys)
==>[out:v[a],in:v[b]]
The above traversal uses groupCount() to count the number of times the "out" and "in" labelled vertices show up (i.e. the number of edges between them). It uses unfold() to iterate through the Map of <Vertex Pairs,Count> (or more literally <List<Vertex>,Long>) and filter out those that have a count greater than 1 (i.e. multiple edges). The final select(keys) drops the "count" as it is not needed anymore (i.e. we just need the keys which hold the vertex pairs for the result).
Perhaps another way to go is with this method:
gremlin> g.V().filter(outE()).
......1> project('out','in').
......2> by().
......3> by(out().
......4> groupCount().
......5> unfold().
......6> filter(select(values).is(gt(1))).
......7> select(keys)).
......8> select(values)
==>[v[a],v[b]]
This approach with project() forgoes the heavier memory requirements for a big groupCount() over the whole graph in favor of building a smaller Map over an individual Vertex that becomes eligible for garbage collection at the end of the by() (or essentially per initial vertex processed).

My suggestion is similar to Stephen's, but also includes the edges or rather the whole path (I guess the Cypher query returned the edges too).
g.V().as("dest").outE().inV().as("cust").
group().by(select("dest","cust")).by(path().fold()).
unfold().filter(select(values).count(local).is(gt(1))).
select(values).unfold()

Algorithm for minimal cost + maximum matching in a general graph

I've got a dataset consisting of nodes and edges.
The nodes respresent people and the edges represent their relations, which each has a cost that's been calculated using euclidean distance.
Now I wish to match these nodes together through their respective edges, where there is only one constraint:
Any node can only be matched with a single other node.
Now from this we know that I'm working in a general graph, where every node could theoretically be matched with any node in the dataset, as long as there is an edge between them.
What I wish to do, is find the solution with the maximum matches and the overall minimum cost.
Node A
Node B
Node C
Node D
- Edge 1:
Start: End Cost
Node A Node B 0.5
- Edge 2:
Start: End Cost
Node B Node C 1
- Edge 3:
Start: End Cost
Node C Node D 0.5
- Edge 2:
Start: End Cost
Node D Node A 1
The solution to this problem, would be the following:
Assign Edge 1 and Edge 3, as that is the maximum amount of matches ( in this case, there's obviously only 2 solutions, but there could be tons of branching edges to other nodes)
Edge 1 and Edge 3 is assigned, because it's the solution with maximum amount of matches and the minimum overall cost (1)
I've looked into quite a few algorithms including Hungarian, Blossom, Minimal-cost flow, but I'm uncertain which is the best for this case. Also there seems so be an awful lot of material to solving these kinds of problems in bipartial graph's, which isn't really the case in this matter.
So I ask you:
Which algorithm would be the best in this scenario to return the (a) maximum amount of matches and (b) with the lowest overall cost.
Do you know of any good material (maybe some easy-to-understand pseudocode), for your recomended algorithm? I'm not the strongest in mathematical notation.

For (a), the most suitable algorithm (there are theoretically faster ones, but they're more difficult to understand) would be Edmonds' Blossom algorithm. Unfortunately it is quite complicated, but I'll try to explain the basis as best I can.
The basic idea is to take a matching, and continually improve it (increase the number of matched nodes) by making some local changes. The key concept is an alternating path: a path from an unmatched node to another unmatched node, with the property that the edges alternate between being in the matching, and being outside it.
If you have an alternating path, then you can increase the size of the matching by one by flipping the state (whether or not they are in the matching) of the edges in the alternating path.
If there exists an alternating path, then the matching is not maximum (since the path gives you a way to increase the size of the matching) and conversely, you can show that if there is no alternating path, then the matching is maximum. So, to find a maximum matching, all you need to be able to do is find an alternating path.
In bipartite graphs, this is very easy to do (it can be done with DFS). In general graphs this is more complicated, and this is were Edmonds' Blossom algorithm comes in. Roughly speaking:
Build a new graph, where there is an edge between two vertices if you can get from u to v by first traversing an edge that is in the matching, and then traversing and edge that isn't.
In this graph, try to find a path from an unmatched vertex to a matched vertex that has an unmatched neighbor (that is, a neighbor in the original graph).
Each edge in the path you find corresponds to two edges of the original graph (namely an edge in the matching and one not in the matching), so the path translates to an alternating walk in the new graph, but this is not necessarily an alternating path (the distinction between path and walk is that a path only uses each vertex once, but a walk can use each vertex multiple times).
If the walk is a path, you have an alternating path and are done.
If not, then the walk uses some vertex more than once. You can remove the part of the walk between the two visits to this vertex, and you obtain a new graph (with part of the vertices removed). In this new graph you have to do the whole search again, and if you find an alternating path in the new graph you can "lift" it to an alternating path for the original graph.
Going into the details of this (crucial) last step would be a bit too much for a stackoverflow answer, but you can find more details on Wikipedia and perhaps having this high-level overview helps you understand the more mathematical articles.
Implementing this from scratch will be quite challenging.
For the weighted version (with the Euclidean distance), there is an even more complicated variant of Edmonds' Algorithm that can handle weights. Kolmogorov offers a C++ implementation and accompanying paper. This can also be used for the unweighted case, so using this implementation might be a good idea (even if it is not in java, there should be some way to interface with it).
Since your weights are based on Euclidean distances there might be a specialized algorithm for that case, but the more general version I mentioned above would also work and and implementation is available for it.

Cypher traversal

I'm implementing a something like a linked-list structure in a Neo4j graph. The graph is created by executing many statements similar to this:
CREATE (R1:root{edgeId:2})-[:HEAD]->
(:node{text: 'edge 2 head text', width:300})-[:NEXT{edge:2, hard:true}]->
(:node{text: 'edge 2 point 0'})-[:NEXT{edge:2}]->
(n0:node{text: 'edge 2 point 1'}),
(n0)-[:BRANCH]->(:root{edgeId:3}),
(n0)-[:NEXT{edge:2}]->
(:node{text: 'edge 2 point 2'})-[:NEXT{edge:2}]->
(:node{text: 'edge 2 point 3'})<-[:TAIL{edge:2}]->(R1)
Traversing an edge means starting with a root node, following its outgoing HEAD relationship to the first node, and following the chain of NEXT relationships until reaching a node with an incoming TAIL relationship from the root we started from.
i.e.:
MATCH path = (root:root:main)-[:HEAD]->(a:point)-[n:NEXT*]->(z:point)<-[:TAIL]-(root)
RETURN nodes(path), n
Every node has an outgoing NEXT relationship, but some nodes also have BRANCH relationships, which point to the root nodes of other edges.
In the above query, nodes(path) obviously returns all the nodes along the edge, and n lists the outgoing NEXT relationship for each node along it. How could I modify this query so that, in addition to the outgoing NEXT relationship, it also returns any outgoing BRANCH relationships
How can I modify the above query so that each record returned contains a node on the path along with a list of all outgoing relationships (both NEXT and BRANCH) from it?
Note that I don't want to traverse the BRANCH edges in this query, I just want it to tell me they're there.
(PS I'm implementing this strategy in Java, but so far have preferred executing Cypher queries directly rather than using the Traversal API. If I'm making this more difficult on myself by doing so, please bring it to my attention).

You can return path-expressions at any time.
MATCH path = (root:root:main)-[:HEAD]->(a:point)-[n:NEXT*]->(z:point)<-[:TAIL]-(root)
RETURN extract(x in nodes(path) | [x, x-[:BRANCH]->()]), n
This x-[:BRANCH]->() returns a collection of paths, so if you just want to access the relationship you'd have to do
[p in x-[:BRANCH]->() | head(rels(p)) ]
For an example of how to implement an activity stream as an unmanaged extension you might have a look at this: https://github.com/jexp/neo4j-activity-stream

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.