graph representation with minimum cost (time and space) - java

i have to represent a graph in java but neither as an adjacency list nor an adjacency matrix..
the basic idea is that if
deg[i]
is that exit degree of vertex i, then its neigboors can store in
edges[i][j] where
i <= j <= deg[i]
, but given that
edges[][]
must be initialized with some values i dont know how to make it differ from an adjacency matrix..
any suggestions?

In my knowledge there are only two ways to represent a graph in a language.
Either Use Adjacency matrix
Or Use Incidence matrix
You can make an incidence matrix like
E1 E2 E3 E4
V1 1 2 1
1 V2 2
1 2 1 V3
1 1 1 2
V4 1 1 2 1

You are fighting against lower bounds on this question. The two main representations of the graph are already very good for their respective use.
Adjacency list, minimizes space. You will be hard pressed to use less memory than 1 pointer per edge. Space: O(V*E). Search: O(V)
Adjacency matrix, is very fast, at the cost of v^2 space. Space: O(V^2). Search: O(1)
So, to make something that is better for both space and time you will have to combine the ideas from both. Also realize will just have better practical performance, theoretically you will not improve O(1) search, or O(V*E) size.
My idea would be to store all the graph nodes in one array. Then for each Node have an adjacency list represented as a bit vector. This would essentially be a matrix like representation, but only for those nodes that exist in the graph, giving you a smaller size than a matrix. Search would be slightly improved over an Adjacency list as the query node can be tested against the bit vector.
Also check out sparse matrix representations.

Related

How priority queue is used with heap to solve min distance

Please bear with me I am very new to data structures.
I am getting confused how a priroity queue is used to solve min distance. For example if I have a matrix and want to find the min distance from the source to the destination, I know that I would perform Dijkstra algorithm in which with a queue I can easily find the distance between source and all elements in the matrix.
However, I am confused how a heap + priority queue is used here. For example say that I start at (1,1) on a grid and want to find the min distance to (3,3) I know how to implement the algorithm in the sense of finding the neighbours and checking the distances and marking as visited. But I have read about priority queues and min heaps and want to implement that.
Right now, my only understanding is a priority queue has a key to position elements. My issue is when I insert the first neighbours (1,0),(0,0),(2,1),(1,2) they are inserted in the pq based on a key (which would be distance in this case). So then the next search would be the element in the matrix with the shortest distance. But with the pq, how can a heap be used here with more then 2 neighbours? For example the children of (1,1) are the 4 neighbours stated above. This would go against the 2*i and 2*i + 1 and i/2
In conclusion, I don't understand how a min heap + priority queue works with finding the min of something like distance.
0 1 2 3
_ _ _ _
0 - |2|1|3|2|
1 - |1|3|5|1|
2 - |5|2|1|4|
3 - |2|4|2|1|
You need to use the priority queue to get the minimum weights in every move so the MinPQ will be fit for this.
MinPQ uses internally technique of heap to put the elements in the right position operations such as sink() swim()
So the MinPQ is the data structure that uses heap technique internally
If I'm interpreting your question correctly, you're getting stuck at this point:
But with the pq, how can a heap be used here with more then 2 neighbours? For example the children of (1,1) are the 4 neighbours stated above. This would go against the 2*i and 2*i + 1 and i/2
It sounds like what's tripping you up is that there are two separate concepts here that you may be combining together. First, there's the notion of "two places in a grid might be next to one another." In that world, you have (up to) four neighbors for each location. Next, there's the shape of the binary heap, in which each node has two children whose locations are given by certain arithmetic computations on array indices. Those are completely independent of one another - the binary heap has no idea that the items its storing come from a grid, and the grid has no idea that there's an array where each node has two children stored at particular positions.
For example, suppose you want to store locations (0, 0), (2, 0), (-2, 0) and (0, 2) in a binary heap, and that the weights of those locations are 1, 2, 3, and 4, respectively. Then the shape of the binary heap might look like this:
(0, 0)
Weight 1
/ \
(2, 0) (0, 2)
Weight 2 Weight 4
/
(0, -2)
Weight 3
This tree still gives each node two children; those children just don't necessarily map back to the relative positions of nodes in the grid.
More generally, treat the priority queue as a black box. Imagine that it's just a magic device that says "you can give me some new thing to store" and "I can give you the cheapest thing you've given be so far" and that's it. The fact that, internally, it coincidentally happens to be implemented as a binary heap is essentially irrelevant.
Hope this helps!

Linked lists as matrix and efficiency

When a big matrix needs to be used in an algorithm, to speed up complexity we were told to use linked lists if the matrix is sparse. Meaning that if the data is mostly the same we can save only the data that are not that value.
But how do we identify the point where using a sparse matrix is not useful anymore ?
For a square matrix of length n how do we calculate the point where we can say that the matrix has too much non-zero data to be written in a linked list ?
I imagine we need to use the memory sizes of an object, a link between two objects, then use our density factor. But what are the calculations to safely say "This matrix has x% non-zero data, it is better to use a linked list ?
The answer to your question depends on what you optimize for. Do you optimize for space or time?
Let's say you optimize for space. To keep data of a square matrix of length n, you need n*n numbers (to simplify, let's say it's an integer for each value). In case of a linked list, you need to have the actual value, the coordination of the value in the matrix and the pointer to the next non-zero value. To simplify, let's say each of those fields is of an integer size. So for a linked list, you need 4 integers for a single value to keep (plus additional data like the head of the linked list).
IMHO, once less than 1/4 of the values in the matrix is non-zero, it's more optimal to use a linked list than an array of arrays.
Obviously, there are other options to keep the matrix values; then the ratio can be different.
To optimize for time, again, it depends which operations you want to run...

Can't understand Poisson part of Hash tables from Sun documentation

I am trying to understand how HashMap is implemented in Java. I decided that I will try to understand every line (of code and comments) from that class and obviously I faced resistance very soon. The following snippet is from HashMap class and talks about Poisson Distribution:
Ideally, under random hashCodes, the frequency of
nodes in bins follows a Poisson distribution
(http://en.wikipedia.org/wiki/Poisson_distribution) with a
parameter of about 0.5 on average for the default resizing
threshold of 0.75, although with a large variance because of
resizing granularity. Ignoring variance, the expected
occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
factorial(k)). The first values are:
0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
more: less than 1 in ten million
I am an average guy in Math and had to understand what Poisson distribution is first. Thanks to the simple video that explained it to me.
Now even after understanding how you calculate probability using Poisson I can't understand what is described above.
Can someone please explain this in simpler language and with an example if possible? It will make my task much more interesting.
A HashMap is organized as an array of "buckets" based on the hashCode of the elements being inserted. Each bucket is (by default) a linked list of elements. Each bucket would have very few elements (ideally, at most one) so that finding a particular element requires very little searching down a linked list.
To take a simple example, let's say we have a HashMap of capacity 4 and a load factor of 0.75 (the default) which means that it can hold up to 3 elements before being resized. An ideal distribution of elements into buckets would look something like this:
bucket | elements
-------+---------
0 | Z
1 | X
2 |
3 | Y
so any element can be found immediately without any searching within a bucket. On the other hand, a very poor distribution of elements would look like this:
bucket | elements
-------+---------
0 |
1 | Z -> X -> Y
2 |
3 |
This will occur if all of the elements happen to hash into the same bucket, so searching for element Y will require traversing down the linked list.
This might not seem like a big deal, but if you have a HashMap with a capacity of 10,000 elements and there are 7,500 elements in a single bucket on a linked list, searching for a particular element will degrade to linear search time -- which is what using a HashMap is trying to avoid.
One issue is that the hashCode for distributing elements into buckets is determined by the objects themselves, and objects' hashCode implementations aren't always very good. If the hashCode isn't very good, then elements can bunch up in certain buckets, and the HashMap will begin to perform poorly.
The comment from the code is talking about the likelihood of different lengths of linked lists appearing in each bucket. First, it assumes the hashCodes are randomly distributed -- which isn't always the case! -- and I think it also assumes that the number of elements in the HashMap is 50% of the number of buckets. Under these assumptions, according to that Poisson distribution, 60.6% of the buckets will be empty, 30.3% will have one element, 7.5% will have two elements, 1.2% will have three elements, and so forth.
In other words, given those (ideal) assumptions, the linked lists within each bucket will usually be very short.
In JDK 8 there is an optimization to turn a linked list into a tree above a certain threshold size, so that at least performance degrades to O(log n) instead of O(n) in the worst case. The question is, what value should be chosen as the threshold? That's what this discussion is all about. The current threshold value TREEIFY_THRESHOLD is 8. Again, under these ideal assumptions, a bucket with a linked list of length 8 will occur only 0.000006% of the time. So if we get a linked list that long, something is clearly not ideal!! It may mean, for instance, that the objects being stored have exceptionally bad hashCodes, so the HashMap has to switch from a linked list to a tree in order to avoid excessive performance degradation.
The link to the source file with the comment in question is here:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/jdk8-b119/src/share/classes/java/util/HashMap.java
The accepted answer is great but I just wanted to fill in why it's reasonable to use a Poisson distribution in particular since I had the exact same question when reading that piece of code.
In the case that we have a fixed number of items k being inserted into a fixed number of buckets n then the number of items in a fixed bucket should follow a Binomial Distribution with
k trials and probality of success 1 / n. This is pretty easy to see; if the hash is random then each item is put into our bucket with probability 1 / n and there are k items.
When k is large and the mean of the Binomial Distribution is small then a good approximation is a Poisson Distribution with the same mean.
In this case the mean is k / n, the the load factor of the hash table. Taking 0.5 for the mean is reasonable because the table tolerates a load factor of at most 0.75 before resizing so the table will be used
a great deal with a load factor of around 0.5.

Best algorthm to get all combination pair of nodes in an undirected graph (need to improve time complexity)

I have an undirected graph A that has : no multiple-links between any two nodes , no self-connected node, there can be some isolated nodes (nodes with degree 0).
I need to go through all the possible combinations of pair of nodes in graph A to assign some kind of score for non-existence links (Let say if my graph has k nodes and n links, then the number of combination should be (k*(k-1)/2 - n) of combinations). The way that I assign score is based on the common neighbor nodes between the 2 nodes of combination.
Ex: score between A-D should be 1, score between G-D should be 0 ...
The biggest problem is that my graph has more than 100.000 nodes and it was too slow to handle almost 10^10 combinations which is my first attempt to approach the solution.
My second thought is since the algorithm is based on common neighbors of the node, I might only need to look at the neighbors so that I can assign score which is different from 0. The rest can be determined as 0 and no need to compute. But this could possibly repeat a combination.
Any idea to approach this solution is appreciated. Please keep in mind that the actual network has more than 100.000 nodes.
If you represent your graph as an adjacency list (rather than an adjacency matrix), you can make use of the fact that your graph has only 600,000 edges to hopefully reduce the computation time.
Let's take a node V[j] with neighbors V[i] and V[k]:
V[i] ---- V[j] ---- V[k]
To find all such pairs of neighbors you can take the list of nodes adjacent to V[j] and find all combinations of those nodes. To avoid duplicates you will have to generate the combinations rather than the permutations of the end nodes V[i] and V[k] by requiring that i < k.
Alternatively, you can start with node V[i] and find all of the nodes that have a distance of 2 from V[i]. Let S be the set of all the nodes adjacent to V[i]. For each node V[j] in S, create a path V[i]-V[j]-V[k] where:
V[k] is adjacent to V[j]
V[k] is not an element of S (to avoid directly connected nodes)
k != i (to avoid cycles)
k > i (to avoid duplications)
I personally like this approach better because it completes the adjacency list for one node before moving on to the next.
Given that you have ~600,000 edges in a graph with ~100,000 nodes, assuming an even distribution of edges across all of the nodes each node would have an average degree of 12. The number of possible paths for each node is then on the order of 102. Over 105 nodes that gives on the order of 107 total paths rather than the theoretical limit of 1010 for a complete graph. Still a large number, but a thousand times faster than before.

Cosine Similarity of Vectors, with < O(n^2) complexity

Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html
However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.
Am I missing something? Is there a java library which does this?
thanks
If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?
Hashmap is good, but it might take a lot of memory.
If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:
i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
if vec1[i].key == vec2[j].key:
result = result + vec1[i].value * vec2[j].value
else if vec1[i].key < vec2[j].key:
i = i + 1
else
j = j + 1
If you are planning on using cosine similarity as a way of finding clusters of similar documents, you may want to consider looking into locality-sensitive hashing, a hash-based approach that was designed specifically with this in mind. Intuitively, LSH hashes the vectors in a way that with high probability places similar elements into the same bucket and distant elements into different buckets. There are LSH schemes that use cosine similarity as their underlying distance, so to find clusters you use LSH to drop things into buckets and then only compute the pairwise distances of elements in the same bucket. In the worst case this will be quadratic (if everything falls in the same bucket), but it's much more likely that you'll have a significant dropoff in work.
Hope this helps!
If you only want to recommend limited items, for example m items, to every item in a set with size of n, the complexity need not to be n^2, but m*n. Since m is a constant, the complexity is linear.
You can check with the project simbase https://github.com/guokr/simbase , it is a vector similarity nosql database.
Simbase use below concepts:
Vector set: a set of vectors
Basis: the basis for vectors, vectors in one vector set have same basis
Recommendation: a one-direction binary relationship between two vector sets which have the same basis

Categories

Resources