Hierarchical Data structure design - java

I have a hierarchical Data. something like:
The following are the characteristics:
A node can have any number of children
The nodes can be marked as special. Once a node is marked special, the whole subtree starting from that node becomes special.
The following are the operations I want to perform:
Tree.get("a.b.d.g") should give me node g
Tree.set("a.b.d.g",value) which set node g's value
at any node I should know who is the root node
at any node I should if I'm part of special subtree
I should be able to copy/move a subtree in to another tree
I can add new nodes or delete new nodes at every level
I should be able to serialize this data
I can currently think of "hashmap of hashmaps" kind of data structure. I can always cache answer to operations 3 and 4 at every node. Of course I need to clear that cache when I do copy or move etc...
Are there any other ways of implementation to achieve best performance from above operations with minimal memory footprint.

For basic modeling, you should use a Composite pattern:
public class TreeNode {
private String id;
private TreeNode parent;
private List<TreeNode> treeNodes = new ArrayList<>();
...
}
Each node has a String id, a reference to its parent, and references its children.
You can get the top root by iterating on getParent() until its null (use recursion).
For parsing the path imagine something like:
public TreeNode get(final String path) {
if (!path.isEmpty()) {
for (TreeNode treeNode : treeNodes) {
if (path.startsWith(treeNode.getId())) {
return treeNode.get(path.substring(...));
}
}
}
return this;
}
Now if you are looking for a way to store this kind of data (graph like) and to have performant queries on it, you can consider using a graph database as #sebgymn mentioned: Neo4j is a great database for that in java.
It is about using a Connected Data Model with NOSQL. Nodes store data in properties, relationships are also stored and explicitly named in Neo4j and acts as links between nodes. You can then execute queries on a Node (properties, relationships to others...).
Here is link to a presentation: http://fr.slideshare.net/neo4j/data-modeling-with-neo4j
A great tutorial: http://technoracle.blogspot.fr/2012/04/getting-started-with-neo4j-beginners.html
A test graph database to execute queries: http://www.neo4j.org/learn/cypher
For instance: you can try implementing a multi-level pattern tree in neo4j (as in your case it is important to check the top root: so it seems your model has different levels on the tree).

Related

jackson rest service returning URI instead of child "objects"

I have a rather complex problem that I cannot find a good solution for.
I have a tree like object structure with a:
Node being extended by
ConditionNode
ResultNode
possible future nodes
Node has several methods like:
getChildren() - returning a List of Node
getTreeName()
getUuid()
This schema works if I want to get the whole tree as json. However, when I want to get a single Node like ConditionNode, my goal is to return a list of URIs pointing to the children nodes instead of returning all the children that one node might have:
/get/conditionNode/uuid -> {uuid: "...", treeName: "someName",
children: [{uri: "/foo/childNode1"}, {uri: "/foo/childNode2"}]
So my idea was to have a third class UrlNodewhich also extends Node, but that results in having more properties in UrlNodethan I need (like uuid, treeName, ...).
So I have three goals.
I want my children to be composable, so everything can be a child and therefor has to be a Node.
I want to have a class that only contains the URI and return that single class as a child representation for single ChildNodes.
I want everything to be able to be reverse mapped. So not only have rest services return json, but also being able to map the returned json into the same objects again (mainly for testing).
Is there a common design pattern for this?
Thanks,
Sven

Elegant way to implement a navigable graph?

This is a design problem. I'm struggling to create a conceptual model for a problem I'm facing.
I have a graph of a number of objects (<1000). These objects are connected together in a myriad of ways. Each of these objects have some attributes.
I need to be able to access these object via both their connections and their attributes.
For example let us assume following objects -
{name: A, attributes:{black, thin, invalid}, connections: {B,C}}
{name: B, attributes:{white, thin, valid}, connections: {A}}
{name: C, attributes:{black, thick, invalid}, connections: {A,B}}
Now I should be able to query this graph in following ways -
Using attributes -
black - yields [A,C]
black.thick - yields C
Using connections -
A.connections[0].connections[0] - yields A
Using combination thereof -
black[0].connections[0] - yields B
My primary language is Java. But I don't think Java is capable of handling these kinds of beasts. Thus I'm trying to implement this in a dynamic language like Python.
I have also thought about using expression language evaluation like OGNL, or a Graph database. But I'm confused. I'm not interested in coding solutions. But what is the correct way to model such a problem?
It sounds like you have some object model which you want to query in different ways. One solution would be to use Java to create your model and then use a scripting language to support querying against this model in different ways. e.g: Java + Groovy would be my recommendation.
You could use the following Java class for the model.
public class Node {
private String name;
private final Set<String> attributes = new HashSet<String>();
private final List<Node> connections = new ArrayList<Node>();
// getter / setter for all
}
You should then populate a list of such objects with 'connections' property properly populated.
To support different kinds of scripting what you need to do is create a context for the scripts and then populated this context. Context is basically a map. The keys of the map become variables available to the script. The trick is to populate this context to support your querying requirements.
For example in groovy the binding is the context (refer http://groovy.codehaus.org/Embedding+Groovy). So if you populate it the following way your querying needs will be taken care of
Context/Binding Map
1. <Node name(String), Node object instance(Node)>
2. <Attribute name(String), list of nodes having this attribute(List<Node>)>
when you evaluate a script saying 'A.connections[0]', in the binding the object stored against key 'A' would be looked up. Then the returned objects 'connections' property will be accessed. Since that is a list the '[0]' syntax on that is permitted in groovy. This will return the object at index 0. Likewise to support your querying requirements you need to populate the context.
It depends where you want your performance to be.
If you want fast queries, and don't mind a bit of extra time/memory when adding an object, keeping an array/list of pointers to objects with specific attributes might be a good idea (particularly if you know during design-time what the possible attributes could be). Then, when adding a new object, say:
{name: A, attributes:{black, thin, invalid}, connections: {B,C}}
add a new pointer to the black list, the thin list, and the invalid list. Quick queries on connections will probably require keeping a list/array of pointers as a member of the object class. Then when you create an object, add pointers for the correct objects.
If you don't mind slower queries and want to optimize performance when adding objects, a linked list might be a better approach. You can just loop through all of the objects, checking at each one if it satisfies the condition of the query.
In this case, it would still be a good idea to keep member pointers for the connections, if (as your question would seem to indicate) you're looking to do multiple-level queries (i.e. A.connections[0].connections[0]. This will result in extremely poor performance if done via nested loops.)
Hopefully that helps, it really kind of depends on what kind of queries you're expecting to call most frequently.
There is no problem expressing this in Java. Just define classes representing nodes sets of nodes. Assuming that there is a fixed set of attributes, it could look like:
enum Attribute {
BLACK, WHITE, THIN, VALID /* etc. */ ;
}
class Node {
public final String name;
public final EnumSet<Attribute> attrs
= EnumSet.noneOf(Attribute.class);
public final NodeSet connections
= new NodeSet();
public Node(String name)
{
this.name = name;
}
// ... methods for adding attributes and connections
}
and then a class that represents a set of nodes:
class NodeSet extends LinkedHashSet<Node> {
/**
* Filters out nodes with at least one of the attributes.
*/
public NodeSet with(Attribute... as) {
NodeSet out = new NodeSet();
for(Node n : this) {
for(a : as)
if (n.attrs.contains(a)) {
out.add(n);
break;
}
}
return out;
}
/**
* Returns all nodes connected to this set.
*/
public NodeSet connections() {
NodeSet out = new NodeSet();
for(Node n : this)
out.addAll(n.connections);
return out;
}
/**
* Returns the first node in the set.
*/
public Node first() {
return iterator().next();
}
}
(I haven't checked that the code compiles, it's just a sketch.) Then, assuming you have a NodeSet all of all the nodes, you can do things like
all.with(BLACK).first().connections()
I think that solving this problem with a graph makes sense. You mention the possibility of using a graph database which I think will allow you to better focus on your problem as opposed to coding infrastructure. A simple in-memory graph like TinkerGraph from the TinkerPop project would be a good place to start.
By using TinkerGraph you then get access to a query language called Gremlin (also see GremlinDocs)which can help answer the questions you posed in your post. Here's a Gremlin session in the REPL which show how to construct the graph you presented and some sample graph traversals that yield the answers you wanted...this first part simple constructs the graph given your example:
gremlin> g = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> a = g.addVertex("A",['color':'black','width':'thin','status':'invalid'])
==>v[A]
gremlin> b = g.addVertex("B",['color':'white','width':'thin','status':'valid'])
==>v[B]
gremlin> c = g.addVertex("C",['color':'black','width':'thick','status':'invalid'])
==>v[C]
gremlin> a.addEdge('connection',b)
==>e[0][A-connection->B]
gremlin> a.addEdge('connection',c)
==>e[1][A-connection->C]
gremlin> b.addEdge('connection',a)
==>e[2][B-connection->A]
gremlin> c.addEdge('connection',a)
==>e[3][C-connection->A]
gremlin> c.addEdge('connection',b)
==>e[4][C-connection->B]
Now the queries:
// black - yields [A,C]
gremlin> g.V.has('color','black')
==>v[A]
==>v[C]
// black.thick - yields C
gremlin> g.V.has('width','thick')
==>v[C]
// A.connections[0].connections[0] - yields A
gremlin> a.out.out[0]
==>v[A]
// black[0].connections[0] - yields B
gremlin> g.V.has('color','black')[0].out[0]
==>v[B]
While this approach does introduce some learning curve if you are unfamiliar with the stack, I think you'll find that graphs fit as solutions to many problems and having some experience with the TinkerPop stack will be generally helpful for other scenarios you encounter.

Storing parent child mapping in memory. To list all reachable child for a parent efficiently

I have parent and child mappings in reational database as below,
relationship_id | parent_id | child_id
1 | 100009 | 600009
2 | 100009 | 600010
3 | 600010 | 100008
for performance optimization, i like to keep all these mappings in memory.
Here, a child will be having more than one parent and a parent has more than 2 children.
I guess, i should use "Graph" data structure.
Populating into memory is a one time activity. My concern is that, when I ask to list all child (not only immediate child) it should return them as fast as possible. Addition and deletion happens rarely.
What data structure and algorithm I should use?
Tried MultiHashMap, to achieve O(1) search time, but it has more redundancy.
Have a graph data structure for parent-child relationships. Each GraphNode can just have an ArrayList of children.
Then have HashMap that maps ID to GraphNode.
You need to figure something out so you don't create a cycle (if this is possible) which will cause an infinite loop.
You'll need a custom Node class and a hashmap to store node references for easy lookup.
for each row in database
if parent node exists in map
get it
else
create it and add it
if child node exists in map
get it
else
create it and add it
set relationship between parent and child
The node class would look something like;
public class Node {
private int id;
private List<Node> parents = new ArrayList<Node>();
private List<Node> children = new ArrayList<Node>();
//getters and setters
}

Utility for graph manipulation in java

I needed a graph structure of key ==>> value such as following image:
Numbers in circle are key of its node.
I wanted access to stored value in key 2-7-6-5 and I wanted by 2-7 key retrieve a sub-graph contains collectin of 2, 6-5, 6-11 keys-values , so I wrote my implementation by nested maps and it worked fine but my question is :
Is there any custom Map implementation or third-party library for solve my situation for cleanup my code from manipulation manually such as String.split or loop and condition statements?
If you are really just looking for a 3rd-Party Java Library to work with graphs take a look at JUNG it has plenty of features for graph manipulation. However, it might be overkill for what you are trying to achieve.
take this one - really good for graph manipulations, and also for dispaying graph structure in swing
<dependency>
<groupId>jgraph</groupId>
<artifactId>jgraph</artifactId>
<version>5.13.0.0</version>
</dependency>
http://www.jgraph.com
This is a fairly simple graph construction and traversal problem. You do not need any libraries. You can do it in a simple java class. For e.g.
http://it-essence.xs4all.nl/roller/technology/entry/three_tree_traversals_in_java
It sounds like you'd want to implement nodes as class instances and links as references. Using maps to implement graph edges would be quite complicated and inefficient. Little wonder you'd want to clean up your code. I'm not sure I understand your problem perfectly, but this ought to be close:
// Null nodes are the simplest type. They represent missing children.
class NullNode {
// Get the values of all leaves descended from this node as a set.
Set<Integer> getValues() { return new HashSet(0); }
// Get the values descended from the node at the end of the given
// key path as a set. For a null node, this should not be called.
Set<Integer> getValues(int [] path, int i) { raise new IllegalOperationException(); }
// Initiate the search for values. The only way that's okay
// for null nodes is when the path is empty.
Set<Integer> getValues(int [] path) {
if (path.length == 0)
return new HashSet(0);
else
raise new IllegalOperationException();
}
}
// A regular node is a null node with a key. It should
// never be instantiated. Use Interior or Leaf nodes for that.
abstract class Node extends NullNode {
int key;
// Initiate the search for values. Only descend if the key matches.
Set<Integer> getValues(int [] path) {
return (path.length > 0 && path[0] == key) ? getValues(path, 1) : new HashSet(0);
}
}
// Interior nodes have two children, which may be Null, Interior, or Leaf.
class InteriorNode extends Node {
Node left, right;
Set<Integer> getValues() {
Set<Integer> v = left.getValues();
v.addAll(right.getValues());
return v;
}
Set<Integer> getValues(int [] path, int i) {
if (i + 1 < path.length) {
// Again we only descend if the key matches.
if (path[i + 1] == left.key) return getValues(left, i + 1);
if (path[i + 1] == right.key) return getValues(right, i + 1);
return new HashSet(0);
}
return getValues(); // Get values from both children.
}
}
// A leaf node has no children and a value.
class LeafNode extends Node {
int value;
Set<Integer> getValues() {
HashSet<Integer> v = new HashSet(1);
v.add(value);
return v;
}
Set<Integer> getValues(int [] path, int i) {
return (i + 1 >= path.length) ? getValues() : new HashSet(0);
}
}
The best graph library which I have found is not written in Java but in Scala and makes usage of some powerful scala features not available in Java, such as abstract types.
It is called Graph for Scala and it is extremely comprehensive, but I have to warn you that while Scala and Java they are intercompatible (you can build them in the same project and call a Java class from a Scala class and vice-versa), some problems might rise when calling Scala from Java when it comes to some features which are not available in Java.
http://www.assembla.com/spaces/scala-graph/wiki
Is there any custom Map implementation or third-party library for solve my situation for cleanup my code from manipulation manually such as String.split or loop and condition statements?
If you want to remove the freedom to written manipulate code then you can create your own libraries. You can easily create libraries in Eclipse by exporting your classes into a Jar file, which I would presume is a trivial task in NetBeans.
If you want to protect against changes to the graph after construction then you need to create an immutable data structure. With an immutable graph structure you have to view the Graph as a Universe, and each operation is a GraphOperation. You can never modify a Graph, only create a new Graph that results from crossing the Graph with your list of GraphOperations. Presuming your Graph structure holds unique node values, this will not pose too much of a problem, since you can happily describe relations using values. Your code will look something like this:
Graph graph2 = graph1.process(graph1.GetTopNode().RemoveLeft());
graph2 = graph2.process(graph2.GetNode(7).AddRight(8));
GetTopNode() returns an object that only provides a view of the nodes. RemoveLeft() returns a GraphOperation object, which Graph.process() uses to create a new graph from the operation. If you want, it could just return a Graph implementation that internally stores a link to graph1, and the list of GraphOperation instances that have been passed into it, allowing you to avoid copying the graph structures too often (pretty much like a string buffer).
If you are looking for Graph database and manipulation in Java, Neo4j might help you. This can be more than what you have bargained for if you are looking for a perfect Graph DB and manipulation API.
It gives you very advanced options to traverse the graph nodes, relationships, auditing. Neo4j is being used across organizations to store very complex hierarchical data, the performance by Neo4j is far better than oracle based R-DB's for complex hierarchical databases.

Load parent/child hierarchy with hibernate controlling leafs

I am trying to load from a database a graph of parent/child objects (similar to the DefaultMutableTreeNode object of Java). There is a simple one-to-many association between the 2. The total number of levels of the graph is known so i know exactly how many times to invoke the 'getChildren()' method.
What i want to do is to NOT call this method for the actual leaf nodes. Usually the graph consists of a few non-leaf nodes and several hundreds leaf nodes. If i specify lazy=false in the hb mapping, i get hundreds of unnecessary queries from hb for the children of leaf nodes, whereas i know beforehand that they are not needed (since i know the total number of levels on the tree).
Unfortunately i cannot use lazy=true and only loop until the parents of the leaf nodes because i am working on a disconnected client/server model and using beanlib to load the whole object graph (that contains several other objects).
So i am trying to find a way to intercept the loading of the 'children' collection and instruct hb to stop when it reaches the leaf nodes. Is there a way to do that?
I am looking at 2 solutions:
What i have in mind is this: when i call the node.getChildren() method (within a hb session), normally hb will perform a db query to get the children: is there a way to intercept this call and just not make it? I know that there are no children so i just want it to fail fast (in fact i don't want to make it at all).
Thank you
Costas
Why don't you just use a boolean leaf property, and make your getChildren method return an empty list if leaf is true?
private boolean leaf;
private List<Node> children;
public List<Node> getChildren() {
if (leaf) {
return Collection.<Node>emptyList();
}
return children;
}
Unless your database is colocated with the java code issueing these queries, it is probably a performance bottleneck to issue a query per node, even if it just a query per inner node. Since you know the maximum levels of the tree (let's assume 3 for the sake of example), the following ought to fetch the entire tree in a single query:
from Node n1
left join n1.children as n2
left join n2.children as n3
left join n3.children as n4
The disadvantage of that method is that the resultset will repeat the data for each inner node for each of its descendants, i.e. the bandwith taken is multiplied by the number of tree levels. If that is an issue because you have many levels, you could enable batch fetching for that collection, or even do something similar by hand:
List<Node> border = Collections.singletonList(rootNode);
while (!border.isEmpty()) {
List<Integer> ids = new ArrayList<Integer>();
for (Node n : border) {
ids.add(n.getId());
}
// initialize the children collection in all nodes in border
session.createQuery("from Node n left join n.children where n.id in ?").setParameter(0, ids).list();
List<Node> newBorder = new ArrayList<Node>();
for (Node n : border) {
newBorder.addAll(n.getChildren());
}
border = newBorder;
}
This will issue as many queries as there are levels in the tree, and transmit the data for each node twice. (Some databases restrict the size of an in-clause. You'd have to batch within the level, then)
You can use AOP around advice around the getChildren call that does something like this (please note this is very rough psuedo-code, you will have to fill in the "blanks"):
childrenResult = node.getChildren()
if (Hibernate.isInitialized(childrenResult)) {
return node.getChildren()
} else {
// Do something else here
}
What this will do is when you make a call to getChildren and the collection is not initialized, it can be ingored or not allowed to continue processing. However, if the item is initialized it will allow the calls to continue. One thing to note about Hibernate.isInitialized is that it will return true on ALL objects but lazy-loaded collections that have not been populated yet.
If you are not able to use AOP, you could always do this check on your own call to getChildren in your code.

Categories

Resources