How to efficiently find similar documents

How to efficiently find similar documents - java

I have lots of document that I have clustered using a clustering algorithm. In the clustering algorithm, each document may belong to more than one clusters. I've created a table storing the document-clusterassignment and another one which stores the cluster-document info. When I look for the list of similar documents to a given document (let's sat d_i). I first retrieve the list of clusters to which it belongs (from the document-cluster table) and then for each cluster c_j in the document-cluster I retrieve the lists of documents which belong to c_j from the cluster-document table. There are more than one c_j, so obviously there will be in multiple lists. Each list have many documents and apparently there might be overlaps among these lists.
In the next phase and in order to find the most similar documents to d_i, I rank the similar documents based on the number of clusters they have in common with d_i.
My question is about the last phase. A naive solution is to create a sorted kind of HashMap which has the document as the key and # common clusters as the value. However as each list might contains many many documents, this may not be the best solution. Is there any other way to rank the similar items? Any preprocessing or ..?

Assuming that the number of arrays is relatively small comparing to the number of elements (and in particular, the number of arrays is in o(logn)), you can do it by a modification of a bucket sort:
Let m be the number of arrays
create a list containing m buckets buckets[], where each bucket[i] is a hashset
for each array arr:
for each element x in arr:
find if x is in any bucket, if so - let that bucket id be i:
remove x from bucket i
i <- i + 1
If no such bucket exist, set i=1
add x to bucket i
for each bucket i=m,m-1,...,1 in descending order:
for each element x in bucket[i]:
yield x
The above runs in O(m^2*n):
Iterating over each array
Iterating over all elements in each array
Finding the relevant bucket.
Note that the last one can be done by adding a map:element->bucket_id, and be done in O(1) using hash tables, so we can improve it to O(m*n).
An alternative is to use a hashmap as a histogram that maps from element to its number of occurances, and then sort the array including all elements based on the histogram. The benefit of this approach: it can be distributed very nicely with map-reduce:
map(partial list of elements l):
for each element x:
emit(x,'1')
reduce(x, list<number>):
s = sum{list}
emit(x,s)
combine(x,list<number>):
s = sum{list} //or size{list} for a combiner
emit(x,s)

Related

How to implement a HashMap data structure in Java?

I want to implement a HashMap data structure, but I can't quite figure out what to do with underlying array structure.
If I'm not mistaken, in HashMap each key is hashed and converted into an integer which is used to refer to the array index. Search time is O(1) because of direct referring.
Let's say K is key and V is value. We can create an array of size n of V type and it will reside in the index produced by hash(K) function. However, hash(K) doesn't produce consecutive indices and Java's arrays aren't sparse and to solve this we can create a very large array to hold elements, but it won't be efficient, it will hold a lot of NULL elements.
One solution is to store elements in consecutive order, and for finding an element we have to search the entire array and check elements' keys but it will be linear time. I want to achieve direct access.
Thanks, beforehand.

Borrowed from the Wikipedia article on hash tables, you can use a smaller array for underlying storage by taking the hash modulo the array size like so:
hash = hashfunc(key)
index = hash % array_size
This is a good solution because you can keep the underlying array relatively dense, you don't have to modify the hash funciton, and it does not affect the time complexity.

You can look at the source code for all your problems.
For the sake of not going through a pretty large code. The solution to your problem is to use modulo. You can choose n=64, then store an element x with h(x) mod 64 = 2 in the second position of the array, ...
If two elements have the same hash modulo n, then you store them next to each other (usually done in a tree map). Another solution would be to increase n.

Why is a Hash Table with linked lists considered to have constant time complexity?

In my COMP class last night we learned about hashing and how it generally works when trying to find an element x in a hash table.
Our scenario was that we have a dataset of 1000 elements inside our table and we want to know if x is contained within that table.
Our professor drew up a Java array of 100 and said that to store these 1000 elements that each position of the array would contain a pointer to a Linked List where we would keep our elements.
Assuming the hashing function perfectly mapped each of the 1000 elements to a value between 0 and 99 and stored the element at the position in the array, there would be 1000/100 = 10 elements contained within each linked list.
Now to know whether x is in the table, we simply hash x, find it's hash value, lookup into the array at that slot and iterate over our linked list to check whether x is in the table.
My professor concluded by saying that the expected complexity of finding whether x is in the table is O(10) which is really just O(1). I cannot understand how this is the case. In my mind, if the dataset is N and the array size is n then it takes on average N/n steps to find x in the table. Isn't this by definition not constant time, because if we scale up the data set the time will still increase?
I've looked through Stack Overflow and online and everyone says hashing is expected time complexity of O(1) with some caveats. I have read people discuss chaining to reduce these caveats. Maybe I am missing something fundamental about determining time complexity.
TLDR: Why does it take O(1) time to lookup a value in a hash table when it seems to still be determined by how large your dataset is (therefore a function of N, therefore not constant).

In my mind, if the dataset is N and the array size is n then it takes on average N/n steps to find x in the table.
This is a misconception, as hashing simply requires you to calculate the correct bucket (in this case, array index) that the object should be stored in. This calculation will not become any more complex if the size of the data set changes.
These caveats that you speak of are most likely hash collisions: where multiple objects share the same hashCode; these can be prevented with a better hash function.

The complexity of a hashed collection for lookups is O(1) because the size of lists (or in Java's case, red-black trees) for each bucket is not dependent on N. Worst-case performance for HashMap if you have a very bad hash function is O(log N), but as the Javadocs point out, you get O(1) performance "assuming the hash function disperses the elements properly among the buckets". With proper dispersion the size of each bucket's collection is more-or-less fixed, and also small enough that constant factors generally overwhelm the polynomial factors.

There is multiple issues here so I will address them 1 by 1:
Worst case analysis vs amortized analysis:
Worst case analysis refers to the absolute worst case scenario that your algorithm can be given relative to running time. As an example, if I am giving an array of unordered elements, and I am told to find an element in it, my best case scenario is when the element is at index [0] the worst possible thing that I can be given is when the element is at the end of the array, in which case if my data set is n, I run n times before finding the element. On the average case however the element is anywhere in the array so I will run n-k steps (where k is the number of elements after the element I am looking for in the array).
Worst case analysis of Hashtables:
There exists only 1 kind of Hashtable that has guaranteed constant time access O(1) to it's elements, Arrays. (And even then it's not actually true do to paging and the way OS's handle memory). The worst possible case that I could give you for a hash table is a data set where every element hashes to the same index. So for example if every single element hashes to index 1, due to collisions, the worst case running time for accessing a value is O(n). This is unavoidable, hashtables always have this behaviour.
Average and best case scenario of hashtables:
You will rarely be given a set that gives you the worst possible case scenario. In general you can expect objects to be hashed to different indexes in your hashtable. Ideally the hash function hashes things in a very spread out manner so that objects get hashed to different indexes in the hash table.
In the specific example your teacher gave you, if 2 things get hashed to the same index, they get put in a linked list. So this is more or less how the table got constructed:
get element E
use the hashing function hash(E) to find the index i in the hash table
add e to the linjed list in hashTable[i].
repeat for all the elements in the data set
So now, let's say I want to find whether element E is on the table. Then:
do hash(E) to find the index i where E is potentially hashed
go to hashTable[i] and iterate through the linked list (up to 10 iterations)
If E is found, then E is in the Hash table, if E is not found, then E is not in the table
The reason why we can guarantee that E is not in the table if we can't find it, is because if it was, it would have been hashed to hashTable[i] so it HAS to be there, if it's on the table.

fastest Java collection for string lookup?

I have a Java class which contains two Strings, for example the name of a person and the name of the group.
I also have a list of groups (about 10) and a list of persons (about 100). The list of my data objects is larger, it can exceed 10.000 items.
Now I would like to search through my data objects such that I find all objects having a person from the person list and a group from the group list.
My question is: what is the best data structure for the person and group list?
I could use an ArrayList and simply iterate until I find a match, but that is obviously inefficient. A HashSet or HashMap would be much better.
Are there even more efficient ways to solve this? Please advise.

Every data structure has pro and cons.
A Map is used to retrieve data in O(1) if you have an access key.
A List is used to mantain an order between elements, but accessing an element using a key is not possible and you need to loop the whole list that happens in O(n).

A good data-structure for storing and lookup strings is a Trie:
It's essentially a tree structure which uses characters or substrings to denote paths to follow.
Advantages over hash-maps (quote from Wikipedia):
Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.

I agree with #Davide answer..If we want fast lookup as well as to maintain the order too, then we can go for LinkedHashMap implementation of Map.
By using it, we can have both things:
Data retrieval, If we have access key.
We can maintain the insertion order, so while iterating we will get the data in the same order as of during insertion.

Depending on the scenario (If you have the data before receiving lists of groups/people), preprocessing the data would save you time.
Comparing the data to the groups/people lists will require at least 10,000+ lookups. Comparing the groups/people lists to the data will require a max 10*100 = 1,000 lookups,.. less if you compare against each group one at a time (10+100 = 110 lookups).

Hash Table implement iterate and findMin

I have coded a standard Hash Table class in java. It has a large array of buckets, and to insert, retrieve or delete elements, I simply calculate the hash of the element and look at the appropriate index in the array to get the right bucket.
However, I would like to implement some sort of iterator. Is there an other way than looping through all the indices in the array and ignoring those that are empty? Because my hash table might contain hundreds of empty entries, and only a few elements that have been hashed and inserted. Is there a O(n) way to iterate instead of O(size of table) when n<<size of table?
To implement findMin, I could simply save the smallest element each time I insert a new one, but I want to use the iterator approach.
Thanks!

You can maintain a linked list of the map entries, like LinkedHashMap does in the standard library.
Or you can make your hash table ensure that the capacity is always at most kn, for some suitable value of k. This will ensure iteration is linear in n.

You could store a sorted list of the non-empty buckets, and insert a bucket's id into the list (if it's not already there) when you insert something in the hash table.
But maybe it's not too expensive to search through a few hundred empty buckets, if it's not buried too deep inside a loop. A little inefficiency might be better than a more complex design.

If order is important to you you should consider using a Binary Search Tree (a left leaning red black tree for example) or a Skip List to implement your Dictionary. They are better for the job in these cases.

ArrayList Searching Multiple Words

int index = Collections.binarySearch(myList, SearchWord);
System.out.println(myList.get(index));
Actually, i stored 1 million words to Array List, now i need to search the particular word through the key. The result is not a single word, it may contain multiple words.
Example is suppose if i type "A" means output is [Aarhus, Aaron, Ababa,...]. The result depends on searching word. How can i do it and which sorting algorithm is best in Collections.

Options:
If you want to stick to the array list, sort it before you search. Then find first key that matches your search criteria and iterate from it onward until you find a key that does not match. Collect all matching keys into some buffer structure. Bingo, you have your answer.
Change data structure to a tree. Either
a simple binary tree - your get all keys sorted automatically. Navigate the tree in depth first fashion. Until you find a key that does not match.
a fancy trie structure. That way you get your get all keys sorted automatically plus you get a significant performance boost due to efficient storage. Rest is the same, navigate the tree, collect matching keys.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to efficiently find similar documents - java

Related

How to implement a HashMap data structure in Java?

Why is a Hash Table with linked lists considered to have constant time complexity?

fastest Java collection for string lookup?

Hash Table implement iterate and findMin

ArrayList Searching Multiple Words

Categories

Resources