Cosine Similarity of Vectors, with < O(n^2) complexity

Cosine Similarity of Vectors, with < O(n^2) complexity - java

Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html
However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.
Am I missing something? Is there a java library which does this?
thanks

If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?

Hashmap is good, but it might take a lot of memory.
If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:
i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
if vec1[i].key == vec2[j].key:
result = result + vec1[i].value * vec2[j].value
else if vec1[i].key < vec2[j].key:
i = i + 1
else
j = j + 1

If you are planning on using cosine similarity as a way of finding clusters of similar documents, you may want to consider looking into locality-sensitive hashing, a hash-based approach that was designed specifically with this in mind. Intuitively, LSH hashes the vectors in a way that with high probability places similar elements into the same bucket and distant elements into different buckets. There are LSH schemes that use cosine similarity as their underlying distance, so to find clusters you use LSH to drop things into buckets and then only compute the pairwise distances of elements in the same bucket. In the worst case this will be quadratic (if everything falls in the same bucket), but it's much more likely that you'll have a significant dropoff in work.
Hope this helps!

If you only want to recommend limited items, for example m items, to every item in a set with size of n, the complexity need not to be n^2, but m*n. Since m is a constant, the complexity is linear.
You can check with the project simbase https://github.com/guokr/simbase , it is a vector similarity nosql database.
Simbase use below concepts:
Vector set: a set of vectors
Basis: the basis for vectors, vectors in one vector set have same basis
Recommendation: a one-direction binary relationship between two vector sets which have the same basis

Related

Time complexity of travel a trie

Would it be O(26n) where 26 is the number of letters of the alphabet and n is the number of levels of the trie? For example this is the code to print a trie:
public void print()
{
for(int i = 0; i < 26; i++)
{
if(this.next[i] != null)
{
this.next[i].print();
}
}
if(this.word != null)
{
System.out.println(this.word.getWord());
}
}
So watching this code makes me think that my aproximation of the time complexity is correct in the worst of the cases that would be the 26 nodes full for n levels.

Would it be O(26n) where 26 is the number of letters of the alphabet and n is the number of levels of the trie?
No. Each node in the trie must be visited, and O(1) work performed for each one (ignoring the work attributable to processing the children, which is accounted separately). The number of children does not matter on a per-node basis, as long as it is bounded by a constant (e.g. 26).
How many nodes are there altogether? Generally more than the number of words stored in the trie, and possibly a lot more. For a naively-implemented, perfectly balanced, complete trie with n levels below the root, each level has 26 times as many nodes as the previous, and so the total number of nodes is 1 + 26 + 262 + ... + 26n. That is O(26n+1) == O(26n) == O(2n), or "exponential" in the number of levels, which also corresponds to the length of the longest word stored within.
But one is more likely to be interested in measuring the complexity in terms of the number of words stored in the trie. With a careful implementation, it is possible to have nodes only for those words and for each maximal initial substring that is common to two or more of those words. In that event, every node has either zero children or at least two, so for w total words, the total number of nodes is bounded by w + w/2 + w/4 + ..., which converges to 2w. Therefore, a traversal of a trie with those structural properties costs O(2w) == O(w).
Moreover, with a little more thought, it is possible to conclude that the particular structural property I described is not really necessary to have O(w) traversal.

I am not familiar with a trie but the big O notation is mainly to depict approximately how quickly the running time or resource consumption grows relative to the input size. The way I think of it is just rereferring to general shape of the curve on the graph rather than exact points on the graph. A O(1) looks like a flat line, while a O(n) looks like a line at a 45 deg angle, etc.
source: https://medium.com/dataseries/how-to-calculate-time-complexity-with-big-o-notation-9afe33aa4c46
Now for the algorithm in the question. I am not familiar with a trie, but at first glace I would say it is O(1) (constant time), because the number of iterations of the loop is constant (always 26). However, in the loop it has this.next[i].print() which could completely change the answer depending on its complexity, and uncovers a important question we need to know: what is n?.
I am going to assume that the this.next[i] is of the same type as this, making the this.next[i].print() kind of a recursive call. In such a scenario the time it takes to finish executing will all depend on the number of instances that will have to be iterated though (visited). This algorithm resembles Depth First Search but does not safe guard against infinite recursion. This may be based on some additional information known about the next[i] instances (nodes) such as an instance is only ever referenced by at most 1 other instance. In this case the runtime complexity would be on order of O(n) where n is the number of instances or nodes.
... assuming that the this.word.getWord() runs in constant time as well. If it depends on some other word input, the runtime may as well be O(n * w) where n is number of nodes and w is the size of the words.

Why is time complexity of get() in HashMap O(1) [duplicate]

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?

A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability

You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.

In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.

Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.

I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).

O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.

If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.

Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).

We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!

It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.

This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).

It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).

Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.

Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)

Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

Can't understand Poisson part of Hash tables from Sun documentation

I am trying to understand how HashMap is implemented in Java. I decided that I will try to understand every line (of code and comments) from that class and obviously I faced resistance very soon. The following snippet is from HashMap class and talks about Poisson Distribution:
Ideally, under random hashCodes, the frequency of
nodes in bins follows a Poisson distribution
(http://en.wikipedia.org/wiki/Poisson_distribution) with a
parameter of about 0.5 on average for the default resizing
threshold of 0.75, although with a large variance because of
resizing granularity. Ignoring variance, the expected
occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
factorial(k)). The first values are:
0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
more: less than 1 in ten million
I am an average guy in Math and had to understand what Poisson distribution is first. Thanks to the simple video that explained it to me.
Now even after understanding how you calculate probability using Poisson I can't understand what is described above.
Can someone please explain this in simpler language and with an example if possible? It will make my task much more interesting.

A HashMap is organized as an array of "buckets" based on the hashCode of the elements being inserted. Each bucket is (by default) a linked list of elements. Each bucket would have very few elements (ideally, at most one) so that finding a particular element requires very little searching down a linked list.
To take a simple example, let's say we have a HashMap of capacity 4 and a load factor of 0.75 (the default) which means that it can hold up to 3 elements before being resized. An ideal distribution of elements into buckets would look something like this:
bucket | elements
-------+---------
0 | Z
1 | X
2 |
3 | Y
so any element can be found immediately without any searching within a bucket. On the other hand, a very poor distribution of elements would look like this:
bucket | elements
-------+---------
0 |
1 | Z -> X -> Y
2 |
3 |
This will occur if all of the elements happen to hash into the same bucket, so searching for element Y will require traversing down the linked list.
This might not seem like a big deal, but if you have a HashMap with a capacity of 10,000 elements and there are 7,500 elements in a single bucket on a linked list, searching for a particular element will degrade to linear search time -- which is what using a HashMap is trying to avoid.
One issue is that the hashCode for distributing elements into buckets is determined by the objects themselves, and objects' hashCode implementations aren't always very good. If the hashCode isn't very good, then elements can bunch up in certain buckets, and the HashMap will begin to perform poorly.
The comment from the code is talking about the likelihood of different lengths of linked lists appearing in each bucket. First, it assumes the hashCodes are randomly distributed -- which isn't always the case! -- and I think it also assumes that the number of elements in the HashMap is 50% of the number of buckets. Under these assumptions, according to that Poisson distribution, 60.6% of the buckets will be empty, 30.3% will have one element, 7.5% will have two elements, 1.2% will have three elements, and so forth.
In other words, given those (ideal) assumptions, the linked lists within each bucket will usually be very short.
In JDK 8 there is an optimization to turn a linked list into a tree above a certain threshold size, so that at least performance degrades to O(log n) instead of O(n) in the worst case. The question is, what value should be chosen as the threshold? That's what this discussion is all about. The current threshold value TREEIFY_THRESHOLD is 8. Again, under these ideal assumptions, a bucket with a linked list of length 8 will occur only 0.000006% of the time. So if we get a linked list that long, something is clearly not ideal!! It may mean, for instance, that the objects being stored have exceptionally bad hashCodes, so the HashMap has to switch from a linked list to a tree in order to avoid excessive performance degradation.
The link to the source file with the comment in question is here:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/jdk8-b119/src/share/classes/java/util/HashMap.java

The accepted answer is great but I just wanted to fill in why it's reasonable to use a Poisson distribution in particular since I had the exact same question when reading that piece of code.
In the case that we have a fixed number of items k being inserted into a fixed number of buckets n then the number of items in a fixed bucket should follow a Binomial Distribution with
k trials and probality of success 1 / n. This is pretty easy to see; if the hash is random then each item is put into our bucket with probability 1 / n and there are k items.
When k is large and the mean of the Binomial Distribution is small then a good approximation is a Poisson Distribution with the same mean.
In this case the mean is k / n, the the load factor of the hash table. Taking 0.5 for the mean is reasonable because the table tolerates a load factor of at most 0.75 before resizing so the table will be used
a great deal with a load factor of around 0.5.

graph representation with minimum cost (time and space)

i have to represent a graph in java but neither as an adjacency list nor an adjacency matrix..
the basic idea is that if
deg[i]
is that exit degree of vertex i, then its neigboors can store in
edges[i][j] where
i <= j <= deg[i]
, but given that
edges[][]
must be initialized with some values i dont know how to make it differ from an adjacency matrix..
any suggestions?

In my knowledge there are only two ways to represent a graph in a language.
Either Use Adjacency matrix
Or Use Incidence matrix
You can make an incidence matrix like
E1 E2 E3 E4
V1 1 2 1
1 V2 2
1 2 1 V3
1 1 1 2
V4 1 1 2 1

You are fighting against lower bounds on this question. The two main representations of the graph are already very good for their respective use.
Adjacency list, minimizes space. You will be hard pressed to use less memory than 1 pointer per edge. Space: O(V*E). Search: O(V)
Adjacency matrix, is very fast, at the cost of v^2 space. Space: O(V^2). Search: O(1)
So, to make something that is better for both space and time you will have to combine the ideas from both. Also realize will just have better practical performance, theoretically you will not improve O(1) search, or O(V*E) size.
My idea would be to store all the graph nodes in one array. Then for each Node have an adjacency list represented as a bit vector. This would essentially be a matrix like representation, but only for those nodes that exist in the graph, giving you a smaller size than a matrix. Search would be slightly improved over an Adjacency list as the query node can be tested against the bit vector.
Also check out sparse matrix representations.

Java: distance metric algorithm design

i am trying to work out the following problem in Java (although it could be done in pretty much any other language):
I am given two arrays of integer values, xs and ys, representing dataPoints on the x-axis. Their length might not be identical, though both are > 0, and they need not be sorted. What I want to calculate is the minimum distance measure between two data set of points. What I mean by that is, for each x I find the closest y in the set of ys and calculate distance, for instance (x-y)^2. For instance:
xs = [1,5]
ys = [10,4,2]
should return (1-2)^2 + (5-4)^2 + (5-10)^2
Distance measure is not important, it's the algorithm I am interested in. I was thinking about sorting both arrays and advance indexes in both of these arrays somehow to achieve something better than bruteforce (for each elem in x, scan all elems in ys to find min) which is O(len1 * len2).
This is my own problem I am working on, not a homework question. All your hints would be greatly appreciated.

I assume that HighPerformanceMark (first comment on your question) is right and you actually take the larger array, find for each element the closest one of the smaller array and sum up some f(dist) over those distances.
I would suggest your approach:
Sort both arrays
indexSmall=0
// sum up
for all elements e in bigArray {
// increase index as long as we get "closer"
while (dist(e,smallArray(indexSmall)) > dist(e,smallArray(indexSmall+1)) {
indexSmall++
}
sum += f(dist(e,smallArray(indexSmall)));
}
which is O(max(len1,len2)*log(max(len1,len2))) for the sorting. The rest is linear to the larger array length. Now dist(x,y) would be something like abs(x-y), and f(d)=d^2 or whatever you want.

You're proposed idea sounds good to me. You can sort the lists in O(n logn) time. Then you can do a single iteration over the longer list using a sliding index on the other to find the "pairs". As you progress through the longer list, you will never have to backtrack on the other. So now your whole algorithm is O(n logn + n) = O(n logn).

Your approach is pretty good and has O(n1*log(n1)+n2*log(n2)) time complexity.
If the arrays has different lengths, another approach is to:
sort the shorter array;
traverse the longer array from start to finish, using binary search to locate the nearest item in the sorted short array.
This has O((n1+n2)*log(n1)) time complexity, where n1 is the length of the shorter array.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.