Bulk operation for creating edges in gremlin

Bulk operation for creating edges in gremlin - java

I have a fairly demanding task and unfortunately I can't get any further. Maybe you have a tip for me:
Goal:
Create a lot of edges with apache gremlin with only with one message
to the gremlin server (kind of bulk operation for creating edges).
The sourceId, the targetId and the type is saved in a list of JAVA pojos.
Use gremlin for java
Do not use IDs from the underlying engine, use some constant property PROP_ID for storing the user-given id
My current approach was:
create a list of map because gremlin java can only inject objects when they're maps or arrays
Object[] edgesMap = edges.stream().map(edge -> {
Map<String, String> m = new HashMap<>();
m.put("sourceId", edge.sourceId);
m.put("targetId", edge.targetId);
m.put("type", edge.type);
return m;
}).collect(Collectors.toList()).toArray();
Now i wanted to inject the object into a traversal and iterate the map. For every value i wanted to create an edge.
GraphTraversal<Vertex, Vertex> traversal = g.withSideEffect("edgeList", edgesMap).V().limit(1).sideEffect(
select("edgeList").unfold().as("edge").sideEffect(
g.V().has(PROP_ID, select("edge").select("targetId")).addE(select("edge").select("type")).from(select("edge").select("sourceId"))
)
);
traversal.iterate();
But unfortunately i cannot use .has in the anonymous traversal because .select(...).select(...) does not inject some constant value but returns a traversal. So i was told in the tinkerpop community that the has-traversal will always be true for every node and as a result for every node some edge is created.
I was told to use the where()-traversal to only get the node that fits with the property PROP_ID and the iterated value from the map. But the where()-function expects some P<String> predicate and with my current knowledge, i'm unable to get the select(...)-values into that predicate.
Maybe someone can help me so that I can either rewrite the traversal or someone has an idea how I can implement the requirements. Thanks! :)

Related

Broadcasting a HashMap in Flink

I am using Flink v.1.4.0.
I am working with the DataSet API and one of the things I want to try is very similar to how broadcast variables are used in Apache Spark.
Practically, I want to apply a map function on a DataSet, go through each of the elements in the DataSet and search for it in a HashMap; if the search element is present in the Map then retrieve the respective value.
The HashMap is very big and I don't know if (since I haven't even built my solution) it needs to be Serializable to be transmitted and used by all workers concurrently.
In general, the solution I have in mind would look like this:
Map<String, T> hashMap = new ... ;
DataSet<Point> points = env.readCsv(...);
points
.map(point -> hashMap.getOrDefault(point.getId, 0))
...
but I don't know if this would work or if it is efficient in any way. After doing a bit of searching I found a much better example here according to which one can us Broadcast variables in Flink to broadcast a List as follows:
DataSet<Point> points = env.readCsv(...);
DataSet<Centroid> centroids = ... ; // some computation
points.map(new RichMapFunction<Point, Integer>() {
private List<Centroid> centroids;
#Override
public void open(Configuration parameters) {
this.centroids = getRuntimeContext().getBroadcastVariable("centroids");
}
#Override
public Integer map(Point p) {
return selectCentroid(centroids, p);
}
}).withBroadcastSet("centroids", centroids);
However, .getBroadcastVariable() seems to only work with a List.
Can someone provide an alternative solution with a HashMap?
How would that solution work?
What is the most efficient way to go about solving this?
Could one use a Flink Managed State to do something similar to how broadcast variables are used? How?
Finally, can I attempt multiple mappings with multiple broadcast variables in a pipeline?

Where do the values of hashMap come from? Two other possible solutions:
Reinitialise/recreate/regenerate hashMap in each instance of your filtering/mapping operator separately in open method. Probably more efficient per record, but duplicates initialisation logic.
Create two DataSet, one for hashMap values, second for points and join those two DataSets using desired join strategy. As an analogy, what you are trying to do could be expressed by SQL query SELECT * FROM points p, hashMap h WHERE h.key = p.id.

What data structure should I use for a map with variable set as keys?

My dataset looks like this:
Task-1, Priority1, (SkillA, SkillB)
Task-2, Priority2, (SkillA)
Task-3, Priority3, (SkillB, SkillC)
Calling application (client) will send in a list of skills - say (SkillD, SkillA).
lookup:
Search thru dataset for SkillD first, and not find anything.
Search for SkillA. We will find two entries - Task-1 with Priority1, Task-2 with Priority2.
Identify the task with highest priority (in this case, Task-1)
Remove Task-1 from that dataset & return Task-1 to client
Design considerations:
there will be lot of add/update/delete to the dataset when website goes live
There are only few skills but not a static list (about 10), but for each skill, there can be thousands of tasks. So, the lookup/retrieval will have to be extremely fast
I have considered simple List with binarySearch(comparator) or Map(skill, SortedSettasks(task)), but looking for more ideas.
What is the best way to design a data structure for this kind of dataset that allows a complex key and sorted array of tasks associated with that key.

How about changing the aproach a bit?
You can use the Guava and a Multimap in particular.
Every experienced Java programmer has, at one point or another, implemented a Map<K, List<V>> or Map<K, Set<V>>, and dealt with the awkwardness of that structure. For example, Map<K, Set<V>> is a typical way to represent an unlabeled directed graph. Guava's Multimap framework makes it easy to handle a mapping from keys to multiple values. A Multimap is a general way to associate keys with arbitrarily many values.
There are two ways to think of a Multimap conceptually: as a collection of mappings from single keys to single values:
I would suggest you having a Multimap of and the answer to your problem in a powerfull feature introduced by Multimap called Views
Good luck!

I would consider MongoDB. The data object for one of your rows sounds like a good fit into a JSON format, versus a row in a table. The reason is because the skill set list may grow. In classic relational DB you solve this through one of three ways, have ever expanding columns to make sure you have max number of skill set columns (this is very ugly), have a separate table that has grouping of skill sets matched to an ID, or store the skill sets as a comma delimited list of skill sets. Each of these suck. In MongoDB you can have array fields and the items in the array are indexable.
So with this in mind I would do all the querying on MongoDB and let it deal with it all. I would create a POJO that would like this:
public class TaskPriority {
String taskId;
String priorityId;
List<String> skillIds;
}
In MongoDB you can index all these fields to get fast searching and querying.
If it is the case that you have to cache these items locally and do these queries off of Java data structures then what you can do is create an index for the items you care about that reference instances of the TaskPriority object.
For example to track skill sets to their TaskPriority's then the following Map can be used:
Map<String, TaskPriority> skillSetToTaskPriority;
You can repeat this for taskId and priorityId. You would have to manage these indexes. This is usually the job of your DB to do.
Finally, you can then have POJO's and tables (or MongodDB collections) that map the taskId to a Task object that contains any meta data about that task that you may wish to have. And the same is true for Priority and SkillSet. So thats 4 MongoDB collections... Tasks, Priorities, SkillSets, and TaskPriorities.

Elegant way to implement a navigable graph?

This is a design problem. I'm struggling to create a conceptual model for a problem I'm facing.
I have a graph of a number of objects (<1000). These objects are connected together in a myriad of ways. Each of these objects have some attributes.
I need to be able to access these object via both their connections and their attributes.
For example let us assume following objects -
{name: A, attributes:{black, thin, invalid}, connections: {B,C}}
{name: B, attributes:{white, thin, valid}, connections: {A}}
{name: C, attributes:{black, thick, invalid}, connections: {A,B}}
Now I should be able to query this graph in following ways -
Using attributes -
black - yields [A,C]
black.thick - yields C
Using connections -
A.connections[0].connections[0] - yields A
Using combination thereof -
black[0].connections[0] - yields B
My primary language is Java. But I don't think Java is capable of handling these kinds of beasts. Thus I'm trying to implement this in a dynamic language like Python.
I have also thought about using expression language evaluation like OGNL, or a Graph database. But I'm confused. I'm not interested in coding solutions. But what is the correct way to model such a problem?

It sounds like you have some object model which you want to query in different ways. One solution would be to use Java to create your model and then use a scripting language to support querying against this model in different ways. e.g: Java + Groovy would be my recommendation.
You could use the following Java class for the model.
public class Node {
private String name;
private final Set<String> attributes = new HashSet<String>();
private final List<Node> connections = new ArrayList<Node>();
// getter / setter for all
}
You should then populate a list of such objects with 'connections' property properly populated.
To support different kinds of scripting what you need to do is create a context for the scripts and then populated this context. Context is basically a map. The keys of the map become variables available to the script. The trick is to populate this context to support your querying requirements.
For example in groovy the binding is the context (refer http://groovy.codehaus.org/Embedding+Groovy). So if you populate it the following way your querying needs will be taken care of
Context/Binding Map
1. <Node name(String), Node object instance(Node)>
2. <Attribute name(String), list of nodes having this attribute(List<Node>)>
when you evaluate a script saying 'A.connections[0]', in the binding the object stored against key 'A' would be looked up. Then the returned objects 'connections' property will be accessed. Since that is a list the '[0]' syntax on that is permitted in groovy. This will return the object at index 0. Likewise to support your querying requirements you need to populate the context.

It depends where you want your performance to be.
If you want fast queries, and don't mind a bit of extra time/memory when adding an object, keeping an array/list of pointers to objects with specific attributes might be a good idea (particularly if you know during design-time what the possible attributes could be). Then, when adding a new object, say:
{name: A, attributes:{black, thin, invalid}, connections: {B,C}}
add a new pointer to the black list, the thin list, and the invalid list. Quick queries on connections will probably require keeping a list/array of pointers as a member of the object class. Then when you create an object, add pointers for the correct objects.
If you don't mind slower queries and want to optimize performance when adding objects, a linked list might be a better approach. You can just loop through all of the objects, checking at each one if it satisfies the condition of the query.
In this case, it would still be a good idea to keep member pointers for the connections, if (as your question would seem to indicate) you're looking to do multiple-level queries (i.e. A.connections[0].connections[0]. This will result in extremely poor performance if done via nested loops.)
Hopefully that helps, it really kind of depends on what kind of queries you're expecting to call most frequently.

There is no problem expressing this in Java. Just define classes representing nodes sets of nodes. Assuming that there is a fixed set of attributes, it could look like:
enum Attribute {
BLACK, WHITE, THIN, VALID /* etc. */ ;
}
class Node {
public final String name;
public final EnumSet<Attribute> attrs
= EnumSet.noneOf(Attribute.class);
public final NodeSet connections
= new NodeSet();
public Node(String name)
{
this.name = name;
}
// ... methods for adding attributes and connections
}
and then a class that represents a set of nodes:
class NodeSet extends LinkedHashSet<Node> {
/**
* Filters out nodes with at least one of the attributes.
*/
public NodeSet with(Attribute... as) {
NodeSet out = new NodeSet();
for(Node n : this) {
for(a : as)
if (n.attrs.contains(a)) {
out.add(n);
break;
}
}
return out;
}
/**
* Returns all nodes connected to this set.
*/
public NodeSet connections() {
NodeSet out = new NodeSet();
for(Node n : this)
out.addAll(n.connections);
return out;
}
/**
* Returns the first node in the set.
*/
public Node first() {
return iterator().next();
}
}
(I haven't checked that the code compiles, it's just a sketch.) Then, assuming you have a NodeSet all of all the nodes, you can do things like
all.with(BLACK).first().connections()

I think that solving this problem with a graph makes sense. You mention the possibility of using a graph database which I think will allow you to better focus on your problem as opposed to coding infrastructure. A simple in-memory graph like TinkerGraph from the TinkerPop project would be a good place to start.
By using TinkerGraph you then get access to a query language called Gremlin (also see GremlinDocs)which can help answer the questions you posed in your post. Here's a Gremlin session in the REPL which show how to construct the graph you presented and some sample graph traversals that yield the answers you wanted...this first part simple constructs the graph given your example:
gremlin> g = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> a = g.addVertex("A",['color':'black','width':'thin','status':'invalid'])
==>v[A]
gremlin> b = g.addVertex("B",['color':'white','width':'thin','status':'valid'])
==>v[B]
gremlin> c = g.addVertex("C",['color':'black','width':'thick','status':'invalid'])
==>v[C]
gremlin> a.addEdge('connection',b)
==>e[0][A-connection->B]
gremlin> a.addEdge('connection',c)
==>e[1][A-connection->C]
gremlin> b.addEdge('connection',a)
==>e[2][B-connection->A]
gremlin> c.addEdge('connection',a)
==>e[3][C-connection->A]
gremlin> c.addEdge('connection',b)
==>e[4][C-connection->B]
Now the queries:
// black - yields [A,C]
gremlin> g.V.has('color','black')
==>v[A]
==>v[C]
// black.thick - yields C
gremlin> g.V.has('width','thick')
==>v[C]
// A.connections[0].connections[0] - yields A
gremlin> a.out.out[0]
==>v[A]
// black[0].connections[0] - yields B
gremlin> g.V.has('color','black')[0].out[0]
==>v[B]
While this approach does introduce some learning curve if you are unfamiliar with the stack, I think you'll find that graphs fit as solutions to many problems and having some experience with the TinkerPop stack will be generally helpful for other scenarios you encounter.

Fast way to count the number of relationships for a node

I am trying to count the number of outgoing relationships of a particular type a node has. My code currently looks like this:
int count = 0;
for (Relationship r : node.getRelationships(RelationshipTypes.MODIFIES, Direction.OUTGOING))
{
count++;
}
return count;
The return type of getRelationships is Iterable so I can't use size() or equivalent. I am trying to avoid having to pull every relationship out of the database because some nodes have lots of relationships ( > 5 million). Is there a faster way of doing this?

No. The way neo4j stores relationships on disk for a node is in a linked list, and they do not keep any type of statistics for nodes or relationships. In order to get a count, you will have to go through all relationships for the node, of that type.
Even if you have a cache, with which they store it more efficiently, the system may still not provide a full picture. You method is the best method.

I would try to store outgoing in a data structure and get the size of the structure. This may take more time when the objects are initialized but it seems like the easiest way to quickly get size.

if node.getRelationships(RelationshipTypes.MODIFIES, Direction.OUTGOING) is returning a type of Collection then
to know the number of outgoing relationships of a particular type a node has , you can simply use the following :
int count = node.getRelationships(RelationshipTypes.MODIFIES, Direction.OUTGOING).size();

I see you are using the neo4j api. The other way would be to go with ThinkerPop gremlin query language which is available both for groovy and scala but they will do the same thing internally. As i know neo4j is giving you access trough an iterator because of performance reasons. For instance you could have million relationships but you want to paginate trough the results on the fly. It would be really be slow if Neo4J would return always a collection of relationships. That's why he returns a iterator and gives you access on the fly to the relationships. They are not retrieved from the DB until you need them.
So i would say NO. I hope i could help you.

How do you query object collections in Java (Criteria/SQL-like)?

Suppose you have a collection of a few hundred in-memory objects and you need to query this List to return objects matching some SQL or Criteria like query. For example, you might have a List of Car objects and you want to return all cars made during the 1960s, with a license plate that starts with AZ, ordered by the name of the car model.
I know about JoSQL, has anyone used this, or have any experience with other/homegrown solutions?

Filtering is one way to do this, as discussed in other answers.
Filtering is not scalable though. On the surface time complexity would appear to be O(n) (i.e. already not scalable if the number of objects in the collection will grow), but actually because one or more tests need to be applied to each object depending on the query, time complexity more accurately is O(n t) where t is the number of tests to apply to each object.
So performance will degrade as additional objects are added to the collection, and/or as the number of tests in the query increases.
There is another way to do this, using indexing and set theory.
One approach is to build indexes on the fields within the objects stored in your collection and which you will subsequently test in your query.
Say you have a collection of Car objects and every Car object has a field color. Say your query is the equivalent of "SELECT * FROM cars WHERE Car.color = 'blue'". You could build an index on Car.color, which would basically look like this:
'blue' -> {Car{name=blue_car_1, color='blue'}, Car{name=blue_car_2, color='blue'}}
'red' -> {Car{name=red_car_1, color='red'}, Car{name=red_car_2, color='red'}}
Then given a query WHERE Car.color = 'blue', the set of blue cars could be retrieved in O(1) time complexity. If there were additional tests in your query, you could then test each car in that candidate set to check if it matched the remaining tests in your query. Since the candidate set is likely to be significantly smaller than the entire collection, time complexity is less than O(n) (in the engineering sense, see comments below). Performance does not degrade as much, when additional objects are added to the collection. But this is still not perfect, read on.
Another approach, is what I would refer to as a standing query index. To explain: with conventional iteration and filtering, the collection is iterated and every object is tested to see if it matches the query. So filtering is like running a query over a collection. A standing query index would be the other way around, where the collection is instead run over the query, but only once for each object in the collection, even though the collection could be queried any number of times.
A standing query index would be like registering a query with some sort of intelligent collection, such that as objects are added to and removed from the collection, the collection would automatically test each object against all of the standing queries which have been registered with it. If an object matches a standing query then the collection could add/remove it to/from a set dedicated to storing objects matching that query. Subsequently, objects matching any of the registered queries could be retrieved in O(1) time complexity.
The information above is taken from CQEngine (Collection Query Engine). This basically is a NoSQL query engine for retrieving objects from Java collections using SQL-like queries, without the overhead of iterating through the collection. It is built around the ideas above, plus some more. Disclaimer: I am the author. It's open source and in maven central. If you find it helpful please upvote this answer!

I have used Apache Commons JXPath in a production application. It allows you to apply XPath expressions to graphs of objects in Java.

yes, I know it's an old post, but technologies appear everyday and the answer will change in the time.
I think this is a good problem to solve it with LambdaJ. You can find it here:
http://code.google.com/p/lambdaj/
Here you have an example:
LOOK FOR ACTIVE CUSTOMERS // (Iterable version)
List<Customer> activeCustomers = new ArrayList<Customer>();
for (Customer customer : customers) {
if (customer.isActive()) {
activeCusomers.add(customer);
}
}
LambdaJ version
List<Customer> activeCustomers = select(customers,
having(on(Customer.class).isActive()));
Of course, having this kind of beauty impacts in the performance (a little... an average of 2 times), but can you find a more readable code?
It has many many features, another example could be sorting:
Sort Iterative
List<Person> sortedByAgePersons = new ArrayList<Person>(persons);
Collections.sort(sortedByAgePersons, new Comparator<Person>() {
public int compare(Person p1, Person p2) {
return Integer.valueOf(p1.getAge()).compareTo(p2.getAge());
}
});
Sort with lambda
List<Person> sortedByAgePersons = sort(persons, on(Person.class).getAge());
Update: after java 8 you can use out of the box lambda expressions, like:
List<Customer> activeCustomers = customers.stream()
.filter(Customer::isActive)
.collect(Collectors.toList());

Continuing the Comparator theme, you may also want to take a look at the Google Collections API. In particular, they have an interface called Predicate, which serves a similar role to Comparator, in that it is a simple interface that can be used by a filtering method, like Sets.filter. They include a whole bunch of composite predicate implementations, to do ANDs, ORs, etc.
Depending on the size of your data set, it may make more sense to use this approach than a SQL or external relational database approach.

If you need a single concrete match, you can have the class implement Comparator, then create a standalone object with all the hashed fields included and use it to return the index of the match. When you want to find more than one (potentially) object in the collection, you'll have to turn to a library like JoSQL (which has worked well in the trivial cases I've used it for).
In general, I tend to embed Derby into even my small applications, use Hibernate annotations to define my model classes and let Hibernate deal with caching schemes to keep everything fast.

I would use a Comparator that takes a range of years and license plate pattern as input parameters. Then just iterate through your collection and copy the objects that match. You'd likely end up making a whole package of custom Comparators with this approach.

The Comparator option is not bad, especially if you use anonymous classes (so as not to create redundant classes in the project), but eventually when you look at the flow of comparisons, it's pretty much just like looping over the entire collection yourself, specifying exactly the conditions for matching items:
if (Car car : cars) {
if (1959 < car.getYear() && 1970 > car.getYear() &&
car.getLicense().startsWith("AZ")) {
result.add(car);
}
}
Then there's the sorting... that might be a pain in the backside, but luckily there's class Collections and its sort methods, one of which receives a Comparator...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.