Cypher traversal - java

I'm implementing a something like a linked-list structure in a Neo4j graph. The graph is created by executing many statements similar to this:
CREATE (R1:root{edgeId:2})-[:HEAD]->
(:node{text: 'edge 2 head text', width:300})-[:NEXT{edge:2, hard:true}]->
(:node{text: 'edge 2 point 0'})-[:NEXT{edge:2}]->
(n0:node{text: 'edge 2 point 1'}),
(n0)-[:BRANCH]->(:root{edgeId:3}),
(n0)-[:NEXT{edge:2}]->
(:node{text: 'edge 2 point 2'})-[:NEXT{edge:2}]->
(:node{text: 'edge 2 point 3'})<-[:TAIL{edge:2}]->(R1)
Traversing an edge means starting with a root node, following its outgoing HEAD relationship to the first node, and following the chain of NEXT relationships until reaching a node with an incoming TAIL relationship from the root we started from.
i.e.:
MATCH path = (root:root:main)-[:HEAD]->(a:point)-[n:NEXT*]->(z:point)<-[:TAIL]-(root)
RETURN nodes(path), n
Every node has an outgoing NEXT relationship, but some nodes also have BRANCH relationships, which point to the root nodes of other edges.
In the above query, nodes(path) obviously returns all the nodes along the edge, and n lists the outgoing NEXT relationship for each node along it. How could I modify this query so that, in addition to the outgoing NEXT relationship, it also returns any outgoing BRANCH relationships
How can I modify the above query so that each record returned contains a node on the path along with a list of all outgoing relationships (both NEXT and BRANCH) from it?
Note that I don't want to traverse the BRANCH edges in this query, I just want it to tell me they're there.
(PS I'm implementing this strategy in Java, but so far have preferred executing Cypher queries directly rather than using the Traversal API. If I'm making this more difficult on myself by doing so, please bring it to my attention).

You can return path-expressions at any time.
MATCH path = (root:root:main)-[:HEAD]->(a:point)-[n:NEXT*]->(z:point)<-[:TAIL]-(root)
RETURN extract(x in nodes(path) | [x, x-[:BRANCH]->()]), n
This x-[:BRANCH]->() returns a collection of paths, so if you just want to access the relationship you'd have to do
[p in x-[:BRANCH]->() | head(rels(p)) ]
For an example of how to implement an activity stream as an unmanaged extension you might have a look at this: https://github.com/jexp/neo4j-activity-stream

Related

Depth-first tree traversal in TinkerPop graph

Given a tree-structured TinkerPop graph with vertices connected by labeled parent-child relationships ([parent-PARENT_CHILD->child]), what's the idiomatic way to traverse and find all those nodes?
I'm new to graph traversals, so it seems more or less straightforward to traverse them with a recursive function:
Stream<Vertex> depthFirst(Vertex v) {
Stream<Vertex> selfStream = Stream.of(v);
Iterator<Vertex> childIterator = v.vertices(Direction.OUT, PARENT_CHILD);
if (childIterator.hasNext()) {
return selfStream.appendAll(
Stream.ofAll(() -> childIterator)
.flatMap(this::depthFirst)
);
}
return selfStream;
}
(N.b. this example uses Vavr streams, but the Java stream version is similar, just slightly more verbose.)
I assume a graph-native implementation would be more performant, especially on databases other than the in-memory TinkerGraph.
However, when I look at the TinkerPop tree recipes, it's not obvious what combination of repeat() / until() etc. is the right one to do what I want.
If I then want to find only those vertices (leaf or branch) having a certain label, again, I can see how to do it with the function above:
Stream<Vertex> nodesWithMyLabel = depthFirst(root)
.filter(v -> "myLabel".equals(v.label()));
but it's far from obvious that this is efficient, and I assume there must be a better graph-native approach.
If you are using TinkerPop, it is best to just write your traversals with Gremlin. Let's use the tree described in the recipe:
g.addV().property(id, 'A').as('a').
addV().property(id, 'B').as('b').
addV().property(id, 'C').as('c').
addV().property(id, 'D').as('d').
addV().property(id, 'E').as('e').
addV().property(id, 'F').as('f').
addV().property(id, 'G').as('g').
addE('hasParent').from('a').to('b').
addE('hasParent').from('b').to('c').
addE('hasParent').from('d').to('c').
addE('hasParent').from('c').to('e').
addE('hasParent').from('e').to('f').
addE('hasParent').from('g').to('f').iterate()
To find all the children of "A", you simply do:
gremlin> g.V('A').repeat(out()).emit()
==>v[B]
==>v[C]
==>v[E]
==>v[F]
The traversal above basically says, "Start at 'A" vertex and traverse on out edges until there are no more, and oh, by the way, emit each of those child vertices as you go." If you want to also get the root of "A" then you just need to switch things around a bit:
gremlin> g.V('A').emit().repeat(out())
==>v[A]
==>v[B]
==>v[C]
==>v[E]
==>v[F]
Going a step further, if you want to emit only certain vertices based on some filter (in your question you specified label) you can just provide a filtering argument to emit(). In this case, I only emit those vertices that have more than one incoming edge:
gremlin> g.V('A').emit(inE().count().is(gt(1))).repeat(out())
==>v[C]
==>v[F]
Here's what I ended up with, after a certain amount of trial and error:
GraphTraversal<Vertex, Vertex> traversal =
graph.traversal().V(parent)
.repeat(out(PARENT_CHILD)) // follow only edges labeled PARENT_CHILD
.emit()
.hasLabel("myLabel"); // filter for vertices labeled "myLabel"
Note that this is slightly different from the recursive version in the original question since I realized I don't actually want to include the parent in the result. (I think, from the Repeat Step docs, that I could include the parent by putting emit() before repeat(), but I haven't tried it.)

gremlin query to retrieve vertices which are having multiple edges between them

Consider the above graph. I would like a gremlin query that returns all nodes that have multiple edges between them as shown in the graph.
this graph was obtained using neo4j cypher query:
MATCH (d:dest)-[r]-(n:cust)
WITH d,n, count(r) as popular
RETURN d, n
ORDER BY popular desc LIMIT 5
for example:
between RITUPRAKA... and Asia there are 8 multiple edges hence the query has returned the 2 nodes along with the edges, similarly for other nodes.
Note: the graph has other nodes with only a single edge between them, these nodes will not be returned.
I would like same thing in gremlin.
I have used given below query
g.V().as('out').out().as('in').select('out','in').groupCount().unfold().filter(select(values).is(gt(1))).select(keys)
it is displaying
out:v[1234],in:v[3456] .....
but instead of displaying Ids of the node I want to display values of the node
like out:ICIC1234,in:HDFC234
I have modified query as
g.V().values("name").as('out').out().as('in').values("name").select('out','in').
groupCount().unfold().filter(select(values).is(gt(1))).select(keys)
but it is showing the error like classcastException, each vertex to be traversed use indexes for fast iteration
Your graph doesn't seem to indicate bi-directional edges are possible so I will answer with that assumption in mind. Here's a simple sample graph - please consider including one on future questions as it makes it much easier than pictures and textual descriptions for those reading your question to understand and to get started writing a Gremlin traversal to help you:
g.addV().property(id,'a').as('a').
addV().property(id,'b').as('b').
addV().property(id,'c').as('c').
addE('knows').from('a').to('b').
addE('knows').from('a').to('b').
addE('knows').from('a').to('c').iterate()
So you can see that vertex "a" has two outgoing edges to "b" and one outgoing edge to "c", thus we should get the "a b" vertex pair. One way to get this is with:
gremlin> g.V().as('out').out().as('in').
......1> select('out','in').
......2> groupCount().
......3> unfold().
......4> filter(select(values).is(gt(1))).
......5> select(keys)
==>[out:v[a],in:v[b]]
The above traversal uses groupCount() to count the number of times the "out" and "in" labelled vertices show up (i.e. the number of edges between them). It uses unfold() to iterate through the Map of <Vertex Pairs,Count> (or more literally <List<Vertex>,Long>) and filter out those that have a count greater than 1 (i.e. multiple edges). The final select(keys) drops the "count" as it is not needed anymore (i.e. we just need the keys which hold the vertex pairs for the result).
Perhaps another way to go is with this method:
gremlin> g.V().filter(outE()).
......1> project('out','in').
......2> by().
......3> by(out().
......4> groupCount().
......5> unfold().
......6> filter(select(values).is(gt(1))).
......7> select(keys)).
......8> select(values)
==>[v[a],v[b]]
This approach with project() forgoes the heavier memory requirements for a big groupCount() over the whole graph in favor of building a smaller Map over an individual Vertex that becomes eligible for garbage collection at the end of the by() (or essentially per initial vertex processed).
My suggestion is similar to Stephen's, but also includes the edges or rather the whole path (I guess the Cypher query returned the edges too).
g.V().as("dest").outE().inV().as("cust").
group().by(select("dest","cust")).by(path().fold()).
unfold().filter(select(values).count(local).is(gt(1))).
select(values).unfold()

Directed Acyclic Graph Traversal in Java Web Application

So I am building a web application where you can build a directed graph where a node will represent some operation and the edge will represent data flow between those operations. So for an edge {u,v} , u must run before v does. Click this link to see a sample graph
START node represents the initial value and the other nodes except the output does the operation as specified. Output node will output the value it receives as input.
Which algorithm approach should i use to process a graph like that ?
This is a perfect example of a topological sort. The most common algorithm for creating a set following the order requirements via traversing is Kahn's Algorithm. The pseudocode can be seen below and the information in the Wikipedia link should be enough to get you started.
L ← Empty list that will contain the sorted elements
S ← Set of all nodes with no incoming edges
while S is non-empty do
remove a node n from S
add n to tail of L
for each node m with an edge e from n to m do
remove edge e from the graph
if m has no other incoming edges then
insert m into S
if graph has edges then
return error (graph has at least one cycle)
else
return L (a topologically sorted order)
Note the "starting node" will be enforced by properly representing the incoming edges in the graph. It will be the only node in S to start. Please let me know in the comments if you would like any other information.

Java A* Implementation results in two connecting nodes

I tried to search google and stackoverflow for similar questions but I wasn't successful in finding any.
My implementation of A* works, however when collecting the path from the start node to the end node it simply loops through two nodes which are connected to each other (I can get from node A to node B but also from node B to node A).
I've followed wikipedia's A* implementation but also when I created Dijksta's algorithm it uses the same method which worked perfectly - so I'm confused as to why this is not.
My current output is this:
Node: 3093,
Node: 3085,
Node: 3093,
Node: 3085,
Node: 3093,
Node: 3085,
Node: 3093,
... repeated
Does anyone know why it doesn't properly store the .from?
Also I'm wanting to store the edges that the program traversed to get the successful route - does anyone know how I'd do that? Can I simply implement a storage that will add the correct edge?
Where you have the for loop with the comment:
"//if the neighbor is in closed set, move to next neighbor" the break statement will just break out of the for loop and continue to evaluate the neighbor even though it is in the closed set.
Setting a boolean here and continue the while instead will atleast fix that problem.

Induced subgraphs in neo4j

I have a graph in neo4j, and for a given node N, I would like to find all nodes that are reachable in a path of no longer than P steps from N, as well as all links between that set of nodes. It seems like this could be possible with either Cypher or the Traversal framework; is one preferred over the other? I'm doing this from Java with an embedded database, and I will need to perform further queries on the subgraph. I've poked around and haven't found any conclusive answers.
I think cypher is the most concise way of getting your desired data, querying for variable length paths, some collecting and refining:
If n is the internal id of your node N and your P is 5:
START begin = node(n) // or e.g. index lookup
MATCH p = (begin)<-[r*..5]-(end) // match all paths of length up to 5
WITH distinct nodes(p) as nodes // collect the nodes contained in the paths
MATCH (x)<-[r]-(y) // find all relationships between nodes
WHERE x in nodes and y in nodes // which were found earlier
RETURN distinct x,r,y // and deduplicate as you find all pairs twice
It might not be the most efficient way, but at least the execution plan explanation in http://console.neo4j.org/ suggests that the y in nodes is considered before the MATCH (x)-[r]-(y).
I couldn't think of a way to avoid matching the relationships twice, therefore the distinct in the return statement.

Categories

Resources