I have a small dilemma I would like to be advised with -
I'm implementing a graph (directed) and I want to make it extra generic - that is Graph where T is the data in the the node(vertex).
To add a vertex to the graph will be - add(T t). The graph will wrap T to a vertex that will hold T inside.
Next I would like to run DFS on the graph - now here comes my dilemma -
Should I keep the "visited" mark in the vertex (as a member) or initiate some map while running the DFS (map of vertex -> status)?
Keeping it in the vertex is less generic (the vertex shouldn't be familiar with the DFS algo and implementation). But creating a map (vertex -> status) is very space consuming.
What do you think?
Thanks a lot!
If you need to run algorithms, especially the more complex ones, you will quickly find that you will have to associate all kinds of data with your vertices. Having a generic way to store data with the graph items is a good idea and of course the access time for reading and writing that data should be O(1), ideally. Simple implementations could be to use HashMap, which have O(1) acess time for most cases, but the factor is relatively high.
For the yFiles Graph Drawing Library they added a mechanism where the data is actually stored at the elements themselves, but you can allocate as many data slots as you like. This is similar to managing an Object[] with each element and using the index into the data array as the "map". If your graph does not change, another strategy is to store the index of the elements in the graph with the elements themselves (just the integer) and then using that index to index into an array, where for each "data map" you have basically one array the size of the number of elements. Both techniques scale very well and provide the best possible access times, unless your data is really sparse (only a fraction of the elements actually need to store the data).
The "Object[] at Elements" approach:
In your vertex and edge class, add a field of type Object[] that is package private.
Implement a Map interface that provides T getData(Vertex) and void setData(Vertex, T)
One implementation of that interface could be backed by a HashMap<Vertex,T> but the one I was suggesting actually only stores an integer index that is used to index into the Object[] arrays at the vertices.
In your graph class add a method createMap that keeps track of the used and free indices and creates a new instance of the above class whose getter and setter implementations use the package private field of the Vertex class to actually access the data
The "One Array" approach:
Add a package private integer field to your Vertex class
Keep the integer fields in sync with the order of the vertices in your graph - the first Vertex has index 0, etc.
In the alternative map implementation, you initially allocate one T[] that has the size of the
number of vertices.
In the getter and setter implementations you take the index of the Vertex and use that to access the values in the array.
For the DFS algorithm I would choose the "one array"-approach as you could use a byte[] (or if "Visited" is all that is required you could even use a BitSet) for space efficiency and you are likely to populate the data for all vertices in DFS if your graph is connected. This should perform a lot better than a HashMap based approach and does not require the boxing and unboxing for storing the data in the Object[].
Related
I'm studying for technical interviews and graphs are kind of hard for me. I'm easy with adjacency matrix but confused with implementation of adjacency lists.
Problem is that most of the implementations of adjacency lists, that I see online (Example1, Example2 and Example3) don't use nodes at all. They just use a HashMap of ints and LinkedLists. Is this even correct? Because the definition (Wikipedia) says that it consists of vertices or nodes. Moreover most implementations of graphs using Adjacency matrix use nodes as opposed to ints. Example4. Am I missing something is this puzzle?
I understand that using ints as opposed to nodes is more space efficient, however it leads to many more complications. For Example, check out this piece of code from example1 -
// add edge from vertices v1 to v2
void addEdge(int v1,int v2){
adj.get(v1).add(v2);
}
its purpose is to add an edge from v1 to v2. It completely ignores the possibility that there may be more than one vertices with same int value, in which case it leaves open the possibility that the method addEdge() can add an edge between unintended vertices.
So are implementations of Adjacency lists in Example1,2,3 wrong? If they are right, will it be bad if I implement an adjacency list using nodes instead of ints? I don't want my interviewers to think I'm some idiot lol
You can use Node(that contains the datatype) or use the datatype (in your examples Integer) straight away and they will both work
Using Node however is a better choice for several reasons
Avoid the problems you rightly mentioned with duplicate data values
Using a Node is more object oriented. It allows the Graph class to work with any datatype that the Node holds. This makes the code more portable since the graph can work with String, Long, Integer etc
To take advantage of the portability I mentioned above, a Node class should be defined like this
class Node<T>{
T data;
}
Therefore you should always use a Node (containing the datatype) in interviews as it looks better and shows you care about designing proper code.
Hope it helps!
In Goodrich and Tamassia's textbook: Data Structures & Algorithms in Java, The Adjacency List Structure implementation of the graph ADT is shown in the diagram below:
An Incident object I(u), containing the list of incident edges to Vertex u, is referenced to in the Vertex u object. This is the case for every Vertex in the graph.
My question is, in a Java implementation of this ADT, what is the point in a separate Incident Object, I(u)?
Why can't incident edges be stored in a field in the Vertex object? I can't see how this would be problematic, and surely it would simplify the implementation?
Why can't incident edges be stored in a field in the Vertex object?
They can, not that it makes a particularly big difference either way. There may be restricting implementations, such as when you have an array of primitives which are your vertices, or the vertices are simply represented by an index, i.e. there is no vertex object (this could be done for efficient memory usage when no object is required, for example) - in this case, you'll need to put the incidence object elsewhere.
I can't say for sure what the authors actually meant (assuming they don't say elsewhere in the book - I didn't check), but it's entirely possible that the arrow from the vertex to the incident object means that the Vertex class contains a reference to the incident object (i.e. has a member being the incident object), i.e. the image already represents the way you think it should work.
Yes it is possible, but it would not be adjacency list implementation.
A problem with this implementation is when a new edge is inserted into an empty graph, it doesn't add the terminal vertices to V. The method insertEdge should call insertVertex or before to call insertEdge it is needed to call insertVertex.
I have a large 2D grid, x-by-y. The user of the application will add data about specific points on this grid. Unfortunately, the grid is far too big to be implemented as a large x-by-y array because the system on which this is running does not have enough memory.
What is a good way to implement this so that only the points that have data added to them are stored in memory?
My first idea was to create a BST of the data points. A hash function such as "(long)x<<32 + y" would be used to compare the nodes.
I then concluded that this could lose efficiency if not well balanced so I came up with the idea of having a BST of comparable BSTs of points. The outer BST would compare the inner BSTs based on their x values. The inner BSTs would compare the points by their y values (and they would all have the same x). So when the programmer wants to see if there is a point at (5,6), they would query the outer BST for 5. If an inner BST exists at that point then the programmer would query the inner BST for 6. The result would be returned.
Can you think of any better way of implementing this?
Edit: In regards to HashMaps: Most HashMaps require having an array for the lookup. One would say "data[hash(Point)] = Point();" to set a point and then find the Point by hashing it to find the index. The problem, however, is that the array would have to be the size of the range of the hash function. If this range is less than the total number of data points that are added then they would either have no room or have to be added to an overflow. Because I don't know the number of points that will be added, I would have to make an assumption that this number would be less than a certain amount and then set the array to that size. Again, this instantiates a very large array (although smaller than originally if the assumption is that there will be less data points than x*y). I would like the structure to scale linearly with the amount of data and not take up a large amount when empty.
It looks like what I want is a SparseArray, as some have mentioned. Are they implemented similarly to having a BST inside of a BST?
Edit2: Map<> is an interface. If I were to use a Map then it looks like TreeMap<> would be the best bet. So I would end up with TreeMap< TreeMap< Point> >, similar to the Map< Map< Point> > suggestions that people have made, which is basically a BST inside of a BST. Thanks for the info, though, because I didn't know that the TreeMap<> was basically the Java SDK of a BST.
Edit3: For those whom it may concern, the selected answer is the best method. Firstly, one must create a Point class that contains (x,y) and implements comparable. The Point could potentially be compared by something like (((long)x)<<32)+y). Then one would TreeMap each point to the data. Searching this is efficient because it is in a balanced tree so log(n) cost. The user can also query all of this data, or iterate through it, by using the TreeMap.entrySet() function, which returns a set of Points along with the data.
In conclusion, this allows for the space-efficient and search-efficient implementation of a sparse array, or in my case, a 2D array, that can also be iterated through efficiently.
Either a Quadtree, a k-d-tree or an R-tree.
Store index to large point array into one of the spatial structures.
Such spatial structures are advantageous if the data is not equally distributed, like geographic data that concentrates in cities, and have no point in the sea.
Think if you can forget the regular grid, and stay with the quad tree.
(Think, why do you need a regular grid? A regular grid is usually only a simplification)
Under no circumstances use Objects to store a Point.
Such an Object needs 20 bytes only for the fact that it is an object! A bad idea for a huge data set.
An int x[], and int[] y, or an int[]xy array is ideal related to memory usage.
Consider reading
Hanan Samet's "Foundations of Multidimensional Data Structures"
(at least the Introduction).
You could use a Map<Pair, Whatever> to store your data (you have to write the Pair class). If you need to iterate the data in some specific order, make Pair Comparable, and use NavigableMap
One approach could be Map<Integer, Map<Integer, Data>>. The key on the outer map is the row value, and the key in the inner map is the column value. The value associated with that inner map (of type Data in this case) corresponds to the data at (row, column). Of course, this won't help if you're looking at trying to do matrix operations or such. For that you'll need sparse matrices.
Another approach is to represent the row and column as a Coordinate class or a Point class. You will need to implement equals and hashCode (should be very trivial). Then, you can represent your data as Map<Point, Data> or Map<Coordinate, Data>.
You could have a list of lists of an object, and that object can encode it's horizontal and vertical position.
class MyClass
{
int x;
int y;
...
}
Maybe I'm being too simplistic here, but I think you can just use a regular HashMap. It would contain custom Point objects as keys:
class Point {
int x;
int y;
}
Then you override the equals method (and thus the hashCode method) to be based on x and y. That way you only store points that have some data.
I think you are on the right track to do this in a memory efficient way - it can be implemented fairly easily by using a map of maps, wrapped in a class to give a clean interface for lookups.
An alternative (and more memory efficient) approach would be to use a single map, where the key was a tuple (x,y). However, this would be less convenient if you need to make queries like 'give me all values where x == some value'.
You might want to look at FlexCompColMatrix, CompColMatrix and other sparse matrices implementations from the Matrix toolkit project.
The performance will really depends on the write/read ratio and on the density of the matrix, but if you're using a matrix package it will be easier to experiment by switching the implementation
My suggestion to you is use Commons Math: The Apache Commons Mathematics Library. Because it will save your day, by leveraging the math force that your application require.
Our class is learning about hash tables, and one of my study questions involves me creating a dictionary using a hash table with separate chaining. However, the catch is that we are not allowed to use Java's provided methods for creating hash tables. Rather, our lecture notes mention that separate chaining involves each cell in an array pointing to a linked list of entries.
Thus, my understanding is that I should create an array of size n (where n is prime), and insert an empty linked list into each position in the array. Then, I use my hash function to hash strings and insert them into the corresponding linked list in the proper array position. I created my hash function, and so far my Dictionary constructor takes in a size and creates an array of that size (actually, of size 4999, both prime and large as discussed in class). Am I on the right track here? Should I now insert a new linked list into each position and then work on insert/remove methods?
What you have sounds good so far.
Bear in mind that an array of object references has each cell null by default, and you can write your insert and remove functions to work with that. If you choose to create a linked list object that contains no data (sometimes called a sentinel node) it may be advantageous to create a single immutable (read-only) instance to put in every empty slot, rather than create 4,999 separate instances with new (where most don't hold any data).
It sounds like you are on the right track.
Some extra pointers:
It's not worth creating a LinkedList in each bucket until it is actually used. So you can leave the buckets as null until they are added to. Just remember to write your accessor functions to take account of this.
It's not always efficient to create a large array immediately. It can be better to start with a small array, keep track of the capacity used, and enlarge the array when necessary (which involves re-bucketing the values into the new array)
It's a good idea to make your class implement the whole of the Map<K,V> interface - just to get some practice implementing the other standard Java collection methods.
My question is about what are the fundamental/concrete data structure (like array) used in implementing abstract data structure implementations like variations maps/trees?
I'm looking for what's used really in java collection, not theoretical answers.
Based on quick code review of Sun/Oracle JDK. You can easily find the details yourself.
Lists/queues
ArrayList
Growing Object[] elementData field. Can hold 10 elements by default, grows by around 50% when cannot hold more objects, copying the old array to a bigger new one. Does not shrink when removing items.
LinkedList
Reference to Entry which in turns hold reference to actual element, previous and next element (if any).
ArrayDeque
Similar to ArrayList but also holding two pointers to internal E[] elements array - head and tail. Both adding and removing elements on either side is just a matter of moving these pointers. The array grows by 200% when is too small.
Maps
HashMap
Growing Entry[] table field holding so called buckets. Each bucket contains linked list of entries having the same hash of the key module table size.
TreeMap
Entry<K,V> root reference holding the root of the red-black balanced tree.
ConcurrentHashMap
Similar to HashMap but access to each bucket (called segment) is synchronized by an independent lock.
Sets
TreeSet
Uses TreeMap underneath (!)
HashSet
Uses HashMap underneath (!)
BitSet
Uses long[] words field to be as memory efficient as possible. Up to 64 bits can be stored in one element.
There is of course one answer for each implementation. Look at the javadocs, they often describe these things. http://docs.oracle.com/javase/7/docs/api/