gremlin query to retrieve vertices which are having multiple edges between them

gremlin query to retrieve vertices which are having multiple edges between them - java

Consider the above graph. I would like a gremlin query that returns all nodes that have multiple edges between them as shown in the graph.
this graph was obtained using neo4j cypher query:
MATCH (d:dest)-[r]-(n:cust)
WITH d,n, count(r) as popular
RETURN d, n
ORDER BY popular desc LIMIT 5
for example:
between RITUPRAKA... and Asia there are 8 multiple edges hence the query has returned the 2 nodes along with the edges, similarly for other nodes.
Note: the graph has other nodes with only a single edge between them, these nodes will not be returned.
I would like same thing in gremlin.
I have used given below query
g.V().as('out').out().as('in').select('out','in').groupCount().unfold().filter(select(values).is(gt(1))).select(keys)
it is displaying
out:v[1234],in:v[3456] .....
but instead of displaying Ids of the node I want to display values of the node
like out:ICIC1234,in:HDFC234
I have modified query as
g.V().values("name").as('out').out().as('in').values("name").select('out','in').
groupCount().unfold().filter(select(values).is(gt(1))).select(keys)
but it is showing the error like classcastException, each vertex to be traversed use indexes for fast iteration

Your graph doesn't seem to indicate bi-directional edges are possible so I will answer with that assumption in mind. Here's a simple sample graph - please consider including one on future questions as it makes it much easier than pictures and textual descriptions for those reading your question to understand and to get started writing a Gremlin traversal to help you:
g.addV().property(id,'a').as('a').
addV().property(id,'b').as('b').
addV().property(id,'c').as('c').
addE('knows').from('a').to('b').
addE('knows').from('a').to('b').
addE('knows').from('a').to('c').iterate()
So you can see that vertex "a" has two outgoing edges to "b" and one outgoing edge to "c", thus we should get the "a b" vertex pair. One way to get this is with:
gremlin> g.V().as('out').out().as('in').
......1> select('out','in').
......2> groupCount().
......3> unfold().
......4> filter(select(values).is(gt(1))).
......5> select(keys)
==>[out:v[a],in:v[b]]
The above traversal uses groupCount() to count the number of times the "out" and "in" labelled vertices show up (i.e. the number of edges between them). It uses unfold() to iterate through the Map of <Vertex Pairs,Count> (or more literally <List<Vertex>,Long>) and filter out those that have a count greater than 1 (i.e. multiple edges). The final select(keys) drops the "count" as it is not needed anymore (i.e. we just need the keys which hold the vertex pairs for the result).
Perhaps another way to go is with this method:
gremlin> g.V().filter(outE()).
......1> project('out','in').
......2> by().
......3> by(out().
......4> groupCount().
......5> unfold().
......6> filter(select(values).is(gt(1))).
......7> select(keys)).
......8> select(values)
==>[v[a],v[b]]
This approach with project() forgoes the heavier memory requirements for a big groupCount() over the whole graph in favor of building a smaller Map over an individual Vertex that becomes eligible for garbage collection at the end of the by() (or essentially per initial vertex processed).

My suggestion is similar to Stephen's, but also includes the edges or rather the whole path (I guess the Cypher query returned the edges too).
g.V().as("dest").outE().inV().as("cust").
group().by(select("dest","cust")).by(path().fold()).
unfold().filter(select(values).count(local).is(gt(1))).
select(values).unfold()

Related

how to print all result from Gremlin traversal

I have a topology, it is
aws_vpc<------(composition)------test---(membership)---->location
So I use query
graph.traversal().V().hasLabel("test").or(
__.out("membership").hasLabel("location"),
__.out("composition").hasLabel("aws_vpc"))
to select it, but how to print all elements' name,
I want to output : test, membership, location,composition, aws_vpc.
Is there a way to achieve this?

You've written a traversal that only detects if "test" vertices have outgoing "membership" edges that have adjacent "location" vertices OR outgoing "composition" edges that have adjacent "aws_vpc" vertices, so all that traversal will return is "test" vertices that match that filter. It does not "select" anything more than that. In fact, the or() is immediately satisfied as soon as a single __.out("membership").hasLabel("location") or __.out("composition").hasLabel("aws_vpc") is returned in the order they are provided to or() so you don't even traverse all of those paths (which is a good thing for a filtering operation).
If you want to return all of the data you describe, you need to write your query in such a way so as to traverse it all and transform it into a format to return. A simple way to do this in your case would be to use project():
g.V().hasLabel('test').
project('data','memberships', 'compositions').
by(__.elementMap()).
by(__.outE("membership").as('e').
inV().hasLabel("location").as('v').
select('e','v').
by(elementMap()).
fold()).
by(__.outE("composition").as('e').
inV().hasLabel("aws_vpc").as('v').
select('e','v').
by(elementMap()).
fold())
This takes each "test" vertex and transforms it to a Map with three keys: "data", "memberships" and "competitions" and then each by() modulator specifies what to do with the current "test" vertex being transformed and places it in the respective key. Note that I chose select() to get the edge and vertex combinations but that could have been a project() step as well if you liked. The key there is to end with fold() so that you reduce the stream of edge data for each "test" vertex to a List of data that can put in the related "memberships" and "compositions" keys.

Tinkerpop/Gremlin: select vertices together with outgoing edge count

I try to find an efficient gremlin query that returns a traversal with the vertex and the number of outgoing edges. Or even better instead of the number of outgoing edges a boolean value if outgoing edges exist or not.
Background: I try to improve the performance of a program that writes some properties on the vertices and then iterates over the outgoing edges to remove some of it. In a lot of cases there are no outgoing edges and the iteration
for (Iterator<Edge> iE = v.edges(Direction.OUT); iE.hasNext();) { ... } consumes a significant part of the runtime. So instead of resolving the ids to vertices (with gts.V(ids) I want to collect the information about the existence of outgoing edges to skip the iteration, if possible.
My first try was:
gts.V(ids).as("v").choose(__.outE(), __.constant(true), __.constant(false)).as("e").select("v", "e");
Second idea was:
gts.V(ids).project("v", "e").by().by(__.outE().count());
Both seem to work, but is there a better solution that does not require the underlying graph implementation to fetch or count all edges?
(We currently use the sqlg implementation of tinkerpop/gremlin with Postgresql and both queries seem to fetch all outgoing edges from Postgresql. This may be a case where some optimization is missing. But my question is not sqlg specific.)

If you only need to know whether edges exist or not then you should limit() results in the by() modulator:
gremlin> g.V().project('v','e').by().by(outE().limit(1).count())
==>[v:v[1],e:1]
==>[v:v[2],e:0]
==>[v:v[3],e:0]
==>[v:v[4],e:1]
==>[v:v[5],e:0]
==>[v:v[6],e:1]
In this way you don't count all of the edges, just the first which is enough to answer your question. You can do true and false if you like with a minor modification:
gremlin> g.V().project('v','e').by().by(coalesce(outE().limit(1).constant(true),constant(false)))
==>[v:v[1],e:true]
==>[v:v[2],e:false]
==>[v:v[3],e:false]
==>[v:v[4],e:true]
==>[v:v[5],e:false]
==>[v:v[6],e:true]

Depth-first tree traversal in TinkerPop graph

Given a tree-structured TinkerPop graph with vertices connected by labeled parent-child relationships ([parent-PARENT_CHILD->child]), what's the idiomatic way to traverse and find all those nodes?
I'm new to graph traversals, so it seems more or less straightforward to traverse them with a recursive function:
Stream<Vertex> depthFirst(Vertex v) {
Stream<Vertex> selfStream = Stream.of(v);
Iterator<Vertex> childIterator = v.vertices(Direction.OUT, PARENT_CHILD);
if (childIterator.hasNext()) {
return selfStream.appendAll(
Stream.ofAll(() -> childIterator)
.flatMap(this::depthFirst)
);
}
return selfStream;
}
(N.b. this example uses Vavr streams, but the Java stream version is similar, just slightly more verbose.)
I assume a graph-native implementation would be more performant, especially on databases other than the in-memory TinkerGraph.
However, when I look at the TinkerPop tree recipes, it's not obvious what combination of repeat() / until() etc. is the right one to do what I want.
If I then want to find only those vertices (leaf or branch) having a certain label, again, I can see how to do it with the function above:
Stream<Vertex> nodesWithMyLabel = depthFirst(root)
.filter(v -> "myLabel".equals(v.label()));
but it's far from obvious that this is efficient, and I assume there must be a better graph-native approach.

If you are using TinkerPop, it is best to just write your traversals with Gremlin. Let's use the tree described in the recipe:
g.addV().property(id, 'A').as('a').
addV().property(id, 'B').as('b').
addV().property(id, 'C').as('c').
addV().property(id, 'D').as('d').
addV().property(id, 'E').as('e').
addV().property(id, 'F').as('f').
addV().property(id, 'G').as('g').
addE('hasParent').from('a').to('b').
addE('hasParent').from('b').to('c').
addE('hasParent').from('d').to('c').
addE('hasParent').from('c').to('e').
addE('hasParent').from('e').to('f').
addE('hasParent').from('g').to('f').iterate()
To find all the children of "A", you simply do:
gremlin> g.V('A').repeat(out()).emit()
==>v[B]
==>v[C]
==>v[E]
==>v[F]
The traversal above basically says, "Start at 'A" vertex and traverse on out edges until there are no more, and oh, by the way, emit each of those child vertices as you go." If you want to also get the root of "A" then you just need to switch things around a bit:
gremlin> g.V('A').emit().repeat(out())
==>v[A]
==>v[B]
==>v[C]
==>v[E]
==>v[F]
Going a step further, if you want to emit only certain vertices based on some filter (in your question you specified label) you can just provide a filtering argument to emit(). In this case, I only emit those vertices that have more than one incoming edge:
gremlin> g.V('A').emit(inE().count().is(gt(1))).repeat(out())
==>v[C]
==>v[F]

Here's what I ended up with, after a certain amount of trial and error:
GraphTraversal<Vertex, Vertex> traversal =
graph.traversal().V(parent)
.repeat(out(PARENT_CHILD)) // follow only edges labeled PARENT_CHILD
.emit()
.hasLabel("myLabel"); // filter for vertices labeled "myLabel"
Note that this is slightly different from the recursive version in the original question since I realized I don't actually want to include the parent in the result. (I think, from the Repeat Step docs, that I could include the parent by putting emit() before repeat(), but I haven't tried it.)

How can I collect property values while traversing a graph with gremlin in Java?

Every vertex in my graph has at least a name property. I have a label L set S of name values. Now I want to collect the values of the name property of all vertices that can be reached (recursively) via a specific outgoing edge with edge label EL from the vertices with the names in set S.
My current solution for a single start node with name S1 looks like the following:
g.traversal().V().hasLabel(L)
.has("name", S1)
.repeat(__.optional(__.out(EL)))
.until(__.out(EL).count().is(0))
.path()
.forEachRemaining(path -> {
path.forEach(e -> System.out.println(((Vertex)e).property("name").value()));});
The println is only to see that this produces the expected result, normally I would collect the names in a Set.
Is there a better way to collect the values of the name property of all the vertices reachable via outgoing edges with label EL?
And what would be the best way to start with multiple vertices (where only the name is known from Set S)?
Currently, the structure is a tree, but if there may by cycles, does the code above prevents endless loops? If not, how can this be done?

Your approach is a good start.
To start from a set of multiple vertices, use the P.within() predicate. TinkerPop provides several other predicates.
Use simplePath() to prevent repeating through loops.
Use store() to keep track of items as it traverses the graph. The by("name") modulator will store the "name" property rather than the vertex.
To get out the result, use cap() to output the items it stored during the traversal. The result at this point is a Set which potentially contains duplicates. Use unfold() to turn the Set into an iterator that we can dedup() then finish with toSet().
graph.traversal().V().hasLabel(L).has("name", P.within(S)).
repeat( __.out(EL).simplePath().store("x").by("name") ).
until( __.outE(EL).count().is(0) ).
cap("x").unfold().dedup().toSet()

How to generate random graphs?

I want to be able to generate random, undirected, and connected graphs in Java. In addition, I want to be able to control the maximum number of vertices in the graph. I am not sure what would be the best way to approach this problem, but here are a few I can think of:
(1) Generate a number between 0 and n and let that be the number of vertices. Then, somehow randomly link vertices together (maybe generate a random number per vertex and let that be the number of edges coming out of said vertex). Traverse the graph starting from an arbitrary vertex (say with Breadth-First-Search) and let our random graph G be all the visited nodes (this way, we make sure that G is connected).
(2) Generate a random square matrix (of 0's and 1's) with side length between 0 and n (somehow). This would be the adjacency matrix for our graph (the diagonal of the matrix should then either be all 1's or all 0's). Make a data structure from the graph and traverse the graph from any node to get a connected list of nodes and call that the graph G.
Any other way to generate a sufficiently random graph is welcomed. Note: I do not need a purely random graph, i.e., the graph you generate doesn't have to have any special mathematical properties (like uniformity of some sort). I simply need lots and lots of graphs for testing purposes of something else.
Here is the Java Node class I am using:
public class Node<T> {
T data;
ArrayList<Node> children= new ArrayList<Node>();
...}
Here is the Graph class I am using (you can tell why I am only interested in connected graphs at the moment):
public class Graph {
Node mainNode;
ArrayList<Node> V= new ArrayList<Node>();
public Graph(Node node){
mainNode= node;
}
...}
As an example, this is how I make graphs for testing purposes right now:
//The following makes a "kite" graph G (with "a" as the main node).
/* a-b
|/|
c-d
*/
Node<String> a= new Node("a");
Node<String> b= new Node("b");
Node<String> c= new Node("c");
Node<String> d= new Node("d");
a.addChild(b);
a.addChild(c);
b.addChild(a);
b.addChild(c);
b.addChild(d);
c.addChild(a);
c.addChild(b);
c.addChild(d);
d.addChild(c);
d.addChild(b);
Graph G1= new Graph(a);

Whatever you want to do with your graph, I guess its density is also an important parameter. Otherwise, you'd just generate a set of small cliques (complete graphs) using random sizes, and then connect them randomly.
If I'm correct, I'd advise you to use the Erdős-Rényi model: it's simple, not far from what you originally proposed, and allows you to control the graph density (so, basically: the number of links).
Here's a short description of this model:
Define a probability value p (the higher p and the denser the graph: 0=no link, 1=fully connected graph);
Create your n nodes (as objects, as an adjacency matrix, or anything that suits you);
Each pair of nodes is connected with a (independent) probability p. So, you have to decide of the existence of a link between them using this probability p. For example, I guess you could ranbdomly draw a value q between 0 and 1 and create the link iff q < p. Then do the same thing for each possible pair of nodes in the graph.
With this model, if your p is large enough, then it's highly probable your graph is connected (cf. the Wikipedia reference for details). In any case, if you have several components, you can also force its connectedness by creating links between nodes of distinct components. First, you have to identify each component by performing breadth-first searches (one for each component). Then, you select pairs of nodes in two distinct components, create a link between them and consider both components as merged. You repeat this process until you've got a single component remaining.

The only tricky part is ensuring that the final graph is connected. To do that, you can use a disjoint set data structure. Keep track of the number of components, initially n. Repeatedly pick pairs of random vertices u and v, adding the edge (u, v) to the graph and to the disjoint set structure, and decrementing the component count when the that structure tells you u and v belonged to different components. Stop when the component count reaches 1. (Note that using an adjacency matrix simplifies managing the case where the edge (u, v) is already present in the graph: in this case, adj[u][v] will be set to 1 a second time, which as desired has no effect.)
If you find this creates graphs that are too dense (or too sparse), then you can use another random number to add edges only k% of the time when the endpoints are already part of the same component (or when they are part of different components), for some k.

The following paper proposes an algorithm that uniformly samples connected random graphs with prescribed degree sequence, with an efficient implementation. It is available in several libraries, like Networkit or igraph.
Fast generation of random connected graphs with prescribed degrees.
Fabien Viger, Matthieu Latapy
Be careful when you make simulations on random graphs: if they are not sampled uniformly, then they may have hidden properties that impact simulations; alternatively, uniformly sampled graphs may be very different from the ones your code will meet in practice...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.