Depth-first tree traversal in TinkerPop graph - java

Given a tree-structured TinkerPop graph with vertices connected by labeled parent-child relationships ([parent-PARENT_CHILD->child]), what's the idiomatic way to traverse and find all those nodes?
I'm new to graph traversals, so it seems more or less straightforward to traverse them with a recursive function:
Stream<Vertex> depthFirst(Vertex v) {
Stream<Vertex> selfStream = Stream.of(v);
Iterator<Vertex> childIterator = v.vertices(Direction.OUT, PARENT_CHILD);
if (childIterator.hasNext()) {
return selfStream.appendAll(
Stream.ofAll(() -> childIterator)
.flatMap(this::depthFirst)
);
}
return selfStream;
}
(N.b. this example uses Vavr streams, but the Java stream version is similar, just slightly more verbose.)
I assume a graph-native implementation would be more performant, especially on databases other than the in-memory TinkerGraph.
However, when I look at the TinkerPop tree recipes, it's not obvious what combination of repeat() / until() etc. is the right one to do what I want.
If I then want to find only those vertices (leaf or branch) having a certain label, again, I can see how to do it with the function above:
Stream<Vertex> nodesWithMyLabel = depthFirst(root)
.filter(v -> "myLabel".equals(v.label()));
but it's far from obvious that this is efficient, and I assume there must be a better graph-native approach.

If you are using TinkerPop, it is best to just write your traversals with Gremlin. Let's use the tree described in the recipe:
g.addV().property(id, 'A').as('a').
addV().property(id, 'B').as('b').
addV().property(id, 'C').as('c').
addV().property(id, 'D').as('d').
addV().property(id, 'E').as('e').
addV().property(id, 'F').as('f').
addV().property(id, 'G').as('g').
addE('hasParent').from('a').to('b').
addE('hasParent').from('b').to('c').
addE('hasParent').from('d').to('c').
addE('hasParent').from('c').to('e').
addE('hasParent').from('e').to('f').
addE('hasParent').from('g').to('f').iterate()
To find all the children of "A", you simply do:
gremlin> g.V('A').repeat(out()).emit()
==>v[B]
==>v[C]
==>v[E]
==>v[F]
The traversal above basically says, "Start at 'A" vertex and traverse on out edges until there are no more, and oh, by the way, emit each of those child vertices as you go." If you want to also get the root of "A" then you just need to switch things around a bit:
gremlin> g.V('A').emit().repeat(out())
==>v[A]
==>v[B]
==>v[C]
==>v[E]
==>v[F]
Going a step further, if you want to emit only certain vertices based on some filter (in your question you specified label) you can just provide a filtering argument to emit(). In this case, I only emit those vertices that have more than one incoming edge:
gremlin> g.V('A').emit(inE().count().is(gt(1))).repeat(out())
==>v[C]
==>v[F]

Here's what I ended up with, after a certain amount of trial and error:
GraphTraversal<Vertex, Vertex> traversal =
graph.traversal().V(parent)
.repeat(out(PARENT_CHILD)) // follow only edges labeled PARENT_CHILD
.emit()
.hasLabel("myLabel"); // filter for vertices labeled "myLabel"
Note that this is slightly different from the recursive version in the original question since I realized I don't actually want to include the parent in the result. (I think, from the Repeat Step docs, that I could include the parent by putting emit() before repeat(), but I haven't tried it.)

Related

how to print all result from Gremlin traversal

I have a topology, it is
aws_vpc<------(composition)------test---(membership)---->location
So I use query
graph.traversal().V().hasLabel("test").or(
__.out("membership").hasLabel("location"),
__.out("composition").hasLabel("aws_vpc"))
to select it, but how to print all elements' name,
I want to output : test, membership, location,composition, aws_vpc.
Is there a way to achieve this?
You've written a traversal that only detects if "test" vertices have outgoing "membership" edges that have adjacent "location" vertices OR outgoing "composition" edges that have adjacent "aws_vpc" vertices, so all that traversal will return is "test" vertices that match that filter. It does not "select" anything more than that. In fact, the or() is immediately satisfied as soon as a single __.out("membership").hasLabel("location") or __.out("composition").hasLabel("aws_vpc") is returned in the order they are provided to or() so you don't even traverse all of those paths (which is a good thing for a filtering operation).
If you want to return all of the data you describe, you need to write your query in such a way so as to traverse it all and transform it into a format to return. A simple way to do this in your case would be to use project():
g.V().hasLabel('test').
project('data','memberships', 'compositions').
by(__.elementMap()).
by(__.outE("membership").as('e').
inV().hasLabel("location").as('v').
select('e','v').
by(elementMap()).
fold()).
by(__.outE("composition").as('e').
inV().hasLabel("aws_vpc").as('v').
select('e','v').
by(elementMap()).
fold())
This takes each "test" vertex and transforms it to a Map with three keys: "data", "memberships" and "competitions" and then each by() modulator specifies what to do with the current "test" vertex being transformed and places it in the respective key. Note that I chose select() to get the edge and vertex combinations but that could have been a project() step as well if you liked. The key there is to end with fold() so that you reduce the stream of edge data for each "test" vertex to a List of data that can put in the related "memberships" and "compositions" keys.

Should I test an exact result of algorithm or just test some elements of the result?

I've written a BFS algorithm and I'd like to test the algorithm.
I've written tests in 2 approaches because I realized that for example the way of storing adjacent vertices may change and the order will be different so the result will be different but not necessarily incorrect.
Test of full path:
#Test
void traverse_UndirectedGraph_CommonTraverse() {
BreadthFirstSearchTest<String> breadthFirstSearchTest= new BreadthFirstSearchTest<>(undirectedGraph);
assertIterableEquals(Lists.newArrayList("A", "B", "E"), breathFirstSearch.traverse("A", "E"));
}
Test if the path contains an initial vertex and a terminal vertex:
#Test
void traverse_UndirectedGraph_CommonTraverse() {
BreadthFirstSearchTest<String> breadthFirstSearchTest= new BreadthFirstSearchTest<>(undirectedGraph);
List<String> path = breathFirstSearch.traverse("A", "E");
assertEquals("A", path.get(0));
assertEquals("E", path.get(path.size() - 1));
}
Is any of these two approaches correct?
If no how would you test that algorithms?
Is any of these two approaches correct?
Probably. But that is hard to say without exactly understanding your full requirements and your context, like the classes/data structures your search is relying on.
If no how would you test that algorithms?
I would follow TDD.
Meaning: you start with writing tests first.
To be precise:
you write one simple test
you ensure the test fails
you then write just enough "production" code so that your test passes
you maybe refactor your code base (to improve its quality)
go back to step 1
In other words: you develop your algorithm while gradually walking from small, simple tests, to more advanced scenarios.
Beyond that, you can also look at this from a true "tester" perspective. Meaning: you totally ignore the implementation. Instead, you look at the problem, and at the contract that the production code should follow. And then you try to find examples for all important cases, and most importantly: edge cases. You write those down, and then run them against your implementation.
(most like: your two test cases are way too simple, and you would need many more)
For cases that have at most one traversal, the simplest verification is by exact match.
For cases that have multiple valid traversals, verification could be either by matching against the enumerated traversals, or by verifying the breadth-first property of the traversal.
For cases which have many valid traversals, verifying the breadth-first property seems to be necessary.
Working from the problem as stated:
Key features are that the graph is un-directed, and the search is breadth first.
No other characteristics of the graph are specified. The graph is assumed to possibly have cycles, and is assumed to not necessarily be connected. For simplicity, at most one edge is present between nodes, and no edge is present from a node to itself.
As basics, a traversal which is obtained by a breadth first search must be a subgraph of the searched graph. That is, each edge of the traversal must be an edge of the searched graph. Also, the initial node of the traversal must be the beginning node of the search and the final node of the traversal must be the target node.
In each case, the search must not get into an infinite loop, and must obtain a breadth first result. Or, must indicate that a traversal is not possible.
Testing should demonstrate a variety of cases, for example, traversal of a list, a tree, a loop, a bipartite graph, or of a complete graph.
One test methodology builds a collection of test graphs (enumerating at least the variety of cases described above), and builds a collection of test cases for each of the graphs. The test cases would supply the initial and final nodes of the case, and would supply the collection of valid traversals.
Supplying the collection of valid results is easy if there is zero, one, or perhaps a handful of valid paths. For particular graphs, there can be many traversals, and as an alternative there might need to be a way to verify the "breadth-first-ness" of a traversal, as opposed to enumerating the possible traversals.
For example:
A <-> B1, B2 <-> C1, C2 <-> D1, D2 <-> E
Here A <-> B1, B2 means that there is an edge between both A and B1 and between A a B2. Similarly, B1, B2 <-> C1, C2 represents the complete bipartite graph of B1 and B2 with C1 and C2.
There are eight valid breadth-first traversals from A to E.
There are traversals which are not valid breadth first traversals, for example:
( A, B1, C1, B2, C2, D1, E )
Also for example, for the simple graph:
A <-> B, C
B <-> C
A breadth first traversal from A to C must yield ( A, C ) and not ( A, B, C ). A depth first traversal may obtain either ( A, C ) or ( A, B, C ) depending on whether the traversal steps from A to B first, or steps from A to C first.
One characterization is, if minimum distances are assigned to nodes, starting with the initial node of the traversal, then a breadth-first traversal must never step from a node to node that is closer to the initial node.
Labeling the first example with distances gives:
A(0) <-> B1(1), B2(1) <-> C1(2), C2(2) <-> D1(3), D2(3) <-> E(4)
Similarly labeling the second candidate traversal gives:
( A(0), B1(1), C1(2), B2(1), C2(2), D1(3), E(4) )
This is not a valid breadth-first traversal because the edge C1 -> B2 decreases the distance from the initial node.

Tinkerpop/Gremlin: select vertices together with outgoing edge count

I try to find an efficient gremlin query that returns a traversal with the vertex and the number of outgoing edges. Or even better instead of the number of outgoing edges a boolean value if outgoing edges exist or not.
Background: I try to improve the performance of a program that writes some properties on the vertices and then iterates over the outgoing edges to remove some of it. In a lot of cases there are no outgoing edges and the iteration
for (Iterator<Edge> iE = v.edges(Direction.OUT); iE.hasNext();) { ... } consumes a significant part of the runtime. So instead of resolving the ids to vertices (with gts.V(ids) I want to collect the information about the existence of outgoing edges to skip the iteration, if possible.
My first try was:
gts.V(ids).as("v").choose(__.outE(), __.constant(true), __.constant(false)).as("e").select("v", "e");
Second idea was:
gts.V(ids).project("v", "e").by().by(__.outE().count());
Both seem to work, but is there a better solution that does not require the underlying graph implementation to fetch or count all edges?
(We currently use the sqlg implementation of tinkerpop/gremlin with Postgresql and both queries seem to fetch all outgoing edges from Postgresql. This may be a case where some optimization is missing. But my question is not sqlg specific.)
If you only need to know whether edges exist or not then you should limit() results in the by() modulator:
gremlin> g.V().project('v','e').by().by(outE().limit(1).count())
==>[v:v[1],e:1]
==>[v:v[2],e:0]
==>[v:v[3],e:0]
==>[v:v[4],e:1]
==>[v:v[5],e:0]
==>[v:v[6],e:1]
In this way you don't count all of the edges, just the first which is enough to answer your question. You can do true and false if you like with a minor modification:
gremlin> g.V().project('v','e').by().by(coalesce(outE().limit(1).constant(true),constant(false)))
==>[v:v[1],e:true]
==>[v:v[2],e:false]
==>[v:v[3],e:false]
==>[v:v[4],e:true]
==>[v:v[5],e:false]
==>[v:v[6],e:true]

gremlin query to retrieve vertices which are having multiple edges between them

Consider the above graph. I would like a gremlin query that returns all nodes that have multiple edges between them as shown in the graph.
this graph was obtained using neo4j cypher query:
MATCH (d:dest)-[r]-(n:cust)
WITH d,n, count(r) as popular
RETURN d, n
ORDER BY popular desc LIMIT 5
for example:
between RITUPRAKA... and Asia there are 8 multiple edges hence the query has returned the 2 nodes along with the edges, similarly for other nodes.
Note: the graph has other nodes with only a single edge between them, these nodes will not be returned.
I would like same thing in gremlin.
I have used given below query
g.V().as('out').out().as('in').select('out','in').groupCount().unfold().filter(select(values).is(gt(1))).select(keys)
it is displaying
out:v[1234],in:v[3456] .....
but instead of displaying Ids of the node I want to display values of the node
like out:ICIC1234,in:HDFC234
I have modified query as
g.V().values("name").as('out').out().as('in').values("name").select('out','in').
groupCount().unfold().filter(select(values).is(gt(1))).select(keys)
but it is showing the error like classcastException, each vertex to be traversed use indexes for fast iteration
Your graph doesn't seem to indicate bi-directional edges are possible so I will answer with that assumption in mind. Here's a simple sample graph - please consider including one on future questions as it makes it much easier than pictures and textual descriptions for those reading your question to understand and to get started writing a Gremlin traversal to help you:
g.addV().property(id,'a').as('a').
addV().property(id,'b').as('b').
addV().property(id,'c').as('c').
addE('knows').from('a').to('b').
addE('knows').from('a').to('b').
addE('knows').from('a').to('c').iterate()
So you can see that vertex "a" has two outgoing edges to "b" and one outgoing edge to "c", thus we should get the "a b" vertex pair. One way to get this is with:
gremlin> g.V().as('out').out().as('in').
......1> select('out','in').
......2> groupCount().
......3> unfold().
......4> filter(select(values).is(gt(1))).
......5> select(keys)
==>[out:v[a],in:v[b]]
The above traversal uses groupCount() to count the number of times the "out" and "in" labelled vertices show up (i.e. the number of edges between them). It uses unfold() to iterate through the Map of <Vertex Pairs,Count> (or more literally <List<Vertex>,Long>) and filter out those that have a count greater than 1 (i.e. multiple edges). The final select(keys) drops the "count" as it is not needed anymore (i.e. we just need the keys which hold the vertex pairs for the result).
Perhaps another way to go is with this method:
gremlin> g.V().filter(outE()).
......1> project('out','in').
......2> by().
......3> by(out().
......4> groupCount().
......5> unfold().
......6> filter(select(values).is(gt(1))).
......7> select(keys)).
......8> select(values)
==>[v[a],v[b]]
This approach with project() forgoes the heavier memory requirements for a big groupCount() over the whole graph in favor of building a smaller Map over an individual Vertex that becomes eligible for garbage collection at the end of the by() (or essentially per initial vertex processed).
My suggestion is similar to Stephen's, but also includes the edges or rather the whole path (I guess the Cypher query returned the edges too).
g.V().as("dest").outE().inV().as("cust").
group().by(select("dest","cust")).by(path().fold()).
unfold().filter(select(values).count(local).is(gt(1))).
select(values).unfold()

Implementation of a generic for graph connectivity algorithms in Java

I have an interface for an object factory that creates graphs from a collection of objects given a vertex creation Function<Object,Vertex> and a linking BiPredicate<Vertex,Vertex>.
This design allows for the specification of arbitrary graph connectivity algorithms by supplying both of these functions, but as far as I've been able to implement it, this comes at the cost of having to loop over all pairs of objects in the input collection like this (classes Graph and Vertex are defined elsewhere):
Function<Object,Vertex> maker; // defined by user.
BiPredicate<Vertex,Vertex> linker; // defined by user.
Graph makeGraph( Collection<Object> input ) {
Graph g = new Graph();
Collection<Vertex> vertices = input.stream.map( ( Objec t ) -> maker.apply( t ) ).collect( Collectors.toList() );
for( Vertex ego : vertices ) {
Collection<Vertex> alters = new ArrayList<>();
alters.addAll( vertices );
alters.remove( ego );
for( Vertex alter : alters ) {
if( linker.test( ego, alter ) ) {
g.makeEdge( ego, alter );
}
}
}
}
I actually have two questions:
is there a more elegant way of iterating over all possible pairs (i,j) in a collection than my ugly solution of creating a new list, copying everything and removing i from the copy?
can anybody think of a way of optimizing that double iteration? Right now the execution time for this is O( n^2 ) in the best case, because the implementation needs to accept a linking function without any knowledge about it, but maybe there are ways around this? e.g. specifying certain parameters to indicate, for example, that the iteration can break after the first failure of the linker test for a co-occurrence network, etc.
Of course, if anyone can think of an alternative way of going about this, I'd be happy to hear it.
EDIT:
Forget the first question, Robert Navado's answer made me realize that I was wrong.
In order to clarify then: I am looking for a way of telling an implementation that the application of the linker function can be optimized under certain conditions (e.g. In the co-occurrence example mentioned above, "sort by position and break after first negative result").
Well, until you can have unlinked vertexes in your graph in the graph and your graph is sparse, I would suggest to store edges rather than Vertexes.
However maximum number of edges in single-linked clique is V*(V-1). So in worse case you'll need O(V^2) iterations to link your graph and even more for multigraph.
As for iterations syntax the following should work as well:
for(Vertex alter : vertices )
for(Vertex ego : vertices ){
//Do the descision
}
Take a look on the JUNG library for graph manipulation. It's probably outdated, but you can take a look at their data structures for inspiration.

Categories

Resources