How does this implementation of binary search work? - java

% java BinarySearch 1.txt < 2.txt
If I have two text files (1.txt and 2.txt), where 2.txt contains values not in 1.txt, how does the binary search work in giving us these values? If the arguments to BinarySearch are a key and a sorted array, I don't see how this applies.
Here is the code for the binary search:
import java.util.Arrays;
public class BinarySearch {
// precondition: array a[] is sorted
public static int rank(int key, int[] a) {
int lo = 0;
int hi = a.length - 1;
while (lo <= hi) {
// Key is in a[lo..hi] or not present.
int mid = lo + (hi - lo) / 2;
if (key < a[mid]) hi = mid - 1;
else if (key > a[mid]) lo = mid + 1;
else return mid;
}
return -1;
}
public static void main(String[] args) {
int[] whitelist = In.readInts(args[0]);
Arrays.sort(whitelist);
// read key; print if not in whitelist
while (!StdIn.isEmpty()) {
int key = StdIn.readInt();
if (rank(key, whitelist) == -1)
StdOut.println(key);
}
}
}
According to Wikipedia, and from what I'd understood: A binary search or half-interval search algorithm finds the position of a specified value (the input "key") within a sorted array.
So how is it working to find uncommon values in two text files?

As I understand the question, you want to know how this program works when it (correctly) determines an entry in 2.txt is NOT in 1.txt. That has a pretty simple answer.
This algorithm sorts the array whitelist. It initializes the lo pointer to point to element 0 and the hi pointer to point to element whitelist.length-1 which is the last elment in whitelist. The array segment is the whole array for the first iteration. The array must be ordered or sorted for this to work.
For each successive iteration, if the value is not found in the middle of the current array segment, the logic determines whether the value has to be in the half-segment above the middle or the half-segment below the middle. That half-segment, excluding the old middle element, becomes the new search segment for the next iteration. The algorithm adjusts the hi and lo pointers to close in, one half of the remaining segment of the array at a time, on where the searched for value has to be, if it is in the array.
Eventually, for a search value not in the array, hi and lo (and therefore mid) will converge to the same single element and it will be the last segment of the array searched, a segment of just one element. If that element doesn't have the search value, then, depending on the search value and that element's value either hi will become mid - 1 or lo will become mid + 1. Either way, the while continuation condition will become false because lo <= hi is no longer true. The new remaining search segment now has a negative size. This can be interpreted as meaning that if a return doesn't occur before the while terminates, then the search didn't find the value in any previous segment and there is no remaining segment to search. Therefore, the search value can't be in the array.
The implementation given in this question works. I've tested it usng the Princeton.edu stdlib that contains the In and StdIn classes used here. I've compiled and run it from a command line using stdin pipe to pipe in the second text file. I don't think I would implement this application like this except as a demonstration of binary search methods, perhaps for a class or examining some techniques.
Here is some further background on why a binary search is used. The reason to use a binary search is to obtain a worst case 2*logBase2(n) execution complexity with an average 1.5*logBase2(n) complexity. A binary search for a value not in the array will always be the worst case of 2*logBase2(n) compares.
A binary search is vastly superior to a linear search that just starts at one end of the array and searches every elemnt until it finds a match or comes to the end of the array. The average search may be about n/2, depending on the distribution of values in the array. A linear search for a value not in the array will always have the worst case of n compares.
In a binary search, each pair of compares eliminates half the possibilites. An array of 1024 entries can be searched in a maximum of 20 compares. Compare that to a 1024 maximum for a linear search. Squaring the size of the searched array only doubles the number of compares for a binary search. A binary search can search an array with 1,048,576 entries with a maximum of 40 compares. Compare that to a linear search maximum of 1,048,576.
The basic binary search algorithm given in the question can be very useful with objects that inherit from a sorted or ordered collection and where you have to implement your own compare and search method to overload the inherited methods. As long as you have a compare that determines less, greater and equal amongst objects, and the collection is ordered or sorted according to that compare, you can use this basic binary search algorithm to search the collection.

while (!StdIn.isEmpty()) { //WHILE THE INPUT FILE (OR STANDARD INPUT) ISN'T EMPTY
int key = StdIn.readInt(); //GET THE NEXT INTEGER
if (rank(key, whitelist) == -1) // USE BINARY SEARCH TO SEARCH FOR THAT INTEGER
StdOut.println(key); //PRINT WHEN IT'S NOT FOUND
}
the code it's executing N binary searches where N is the number of integers in the standard input file.
the complexity is O(n * log n) + O(m * log n). n and m the sizes of the different files. n of the while list, and m of the other. This will work well if the whilelist is much smaller than the other file. If not, it'd probably be a better idea to sort both files, and compare them using something like the merge step of the merge sort.

I think creating hash table will be better than modified merge sort algorithm for comparing large files containing only ints.All you have to do is to read the first file(which it is already doing) and while reading put the ints in some sort of hash table. Read the next file one int at a time, which the loop in main is doing, calculate the hash of the int and compare if the table contains any value or not in the hash table corresponding to the hash. i have assumed perfect hash table, so you might need to modify in case of collisions.

Related

Reduce the time complexity of the code to find duplicates in an Array from N*N

I was recently asked in an interview to write code to determine if an array of integers contains duplicates simple as it was i confidently told him that i would iterate the elements and add each one to an new array if the array doesn't contain the element already if it does i return true else return false complexity
the code would be like this
//complexity is N*N
public static boolean findIfArrayHasDuplicates(int[] array){
int[] newArr = new int[array.length];
List<Integer> intList = new ArrayList<Integer>();
for (int var: array){
if(intList.contains(var)){
return true;
}else{
intList.add(var);
}
}
return false;
}
}
He asked me to calculate the time complexity of the code i have written
I answered
N for iteration of the loop
N(N+1)/2 for finding if the element exists in in new list
N for adding the elements in the list
total N +N + N*N/2 + N/2 in O() notation multiplying by 2 and simplifying as N tends to infinity this can be simplified O(N^2)
he went on to ask me if there is any better way i answered add the elements to a set and compare the size if the size of set is less than than of the array it contains duplicates , asked what is the complexity of it guess what its still O(N^2) because of the code that adds elements to the set will have to first see if its in the set already .
How can I reduce the complexity from O(N^2) using as much as memory needed.
Any ideas how this can be done ?
he went on to ask me if there is any better way i answered add the elements to a set and compare the size if the size of set is less than than of the array it contains duplicates , asked what is the complexity of it guess what its still NN because of the code that adds elements to the set will have to first see if its in the set already
That's wrong. If you are adding the elements to a HashSet, it takes expected O(1) time to add each element (which includes checking if the element is already present), since all you have to do is compute the hashCode to locate the bin that may contain the element (takes constant time), and then search the elements stored in that bin (which also takes expected constant time, assuming the average number of elements in each bin is bound by a constant).
Therefore the total running time is O(N), and there's nothing to improve (you can't find duplicates in less than O(N)).
I think it would be useful if you would look at basic working mechanisms of a HashSet. A HashSet is internally an array, but data accesses, such as "checking if an element exists", or "adding/deleting an element", is of time complexity O(1), because it takes on a mapping mechanism to map an object to the index storing it. For example, if you have a HashSet and an integer and you do a hashSet.contains(integer), the program will first take the integer and calculate the hash code of it, and then use the mapping mechanism (differs from implementation to implementation) to find the index storing it. Say we have a hash code of 4, and we map to an index of 4 using the simplest mapping mechanism, then we will check if the 4th element of the underlying array is empty. If it is, hashSet.contains(integer) will return true, otherwise false.
The complexity of code provided is O(N^2). However, the code given follows is complexity O(N). It uses HashSet, which required O(1) operation to insert and search.
public static boolean findIfArrayHasDuplicates(int[] array){
HashSet<Integer> set = new HashSet<Integer>();
set.add(array[0]);
for (int index = 1; index < array.length; index++) {
if(!set.add(array[index]))
return true;
}
return false;
}

Efficient way to store multiple ranges of numbers for future searching

I have a text file full of IP addresses ranges. I use ip2long to convert the addresses to longs so that I have an easy way to check if a given address is within the range. However, I'm looking for an efficient way to store these ranges and then search to see if an IP address exists within any of the ranges.
The current method I'm thinking of is to create an object that has the low end and the high end of a range with a function to check if the value is within range. I would store these objects in a list and check each one. However, I feel this might be a bit inefficient and can get slow as the list increases.
Is there a better way than what I'm thinking?
One of the following data structures may help you:
Segment Tree
From Wikipedia (Implementation):
Is a tree data structure for storing intervals, or segments. It allows querying which of the stored segments contain a given point.
Interval Tree
From Wikipedia (Implementation):
Is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point.
Range Tree
From Wikipedia (Implementation):
Is an ordered tree data structure to hold a list of points. It allows all points within a given range to be reported efficiently.
Assume the ranges do not overlap, otherwise you could combine them into one range.
Then create an array of increasingly ordered begin1, end1, begin2, end2, ....
Where begini is inclusive in the range, and endi is just after the range.
Now do a binary search and:
int pos = ... .binarySearch ...
boolean found = pos >= 0;
if (!found) {
pos = ~pos;
}
boolean atBegin = pos % 2 == 0;
boolean insideRange = (found && atBegin) || (!found && !atBegin);
//Equivalent: boolean insideRange = found == atBegin;
The lookup test is O(log N). The creation of the initial array is much more complex.
Java binarySearch delivers the index when found, and ~index (ones complement, < 0) when not found.
Addendum: I think the above can be "smartly" comprised by
boolean insideRange = (Arrays.binarySearch(...) & 1) == 0;
though an explaining comment would definitely be needed. I leave that to the reader.

Fast and efficient computation on arrays

I want to count the number of occurances for a particular phrase in a document. For example "stackoverflow forums". Suppose D represents the documents set with document containing both terms.
Now, suppose I have the following data structure:
A[numTerms][numMatchedDocuments][numOccurInADocument]
where numMatchedDocuments is the size of D and numOccurInADocument is the number of occurrences a particular term occurs in a particular document, for example:
A[stackoverflow][document1][occurance1]=3;
means, the term "stackoverflow" occurs in document "document1" and its first occurance is at position "3".
Then I pick the term that occur the least and loop over all its positions to find if "forum" occurs at a position+1 the current term "stackoverflow" positions. In other words, if I find "forum" at position 4 then that is a phrase and I've found a match for it.
the matching is straightforward per document and runs reasonably fast but when the number of documents exceed 2,000,000 it gets very slow. I've distributed it over cores and it gets faster of course but wonder if there is algorithmically better way of doing this.
thanks,
Psudo-Code:
boolean docPhrase=true;
int numOfTerms=2;
// 0 for "stackoverflow" and 1 for "forums"
for (int d=0;d<D.size();d++){
//D is a set containing the matched documents
int minId=getTheLeastOccuringTerm();
for (int i=0; i<A[minId][d].length;i++){ // For every position for LeastOccuringTerm
for( int t=0;t<numOfTerms;t++){ // For every terms
int id=BinarySearch(A[t][d], A[minId][d][i] - minId + t);
if (id<0) docPhrase=false;
}
}
}
As I mentioned in comments, Suffix Array can solve this sort of problem. I answered a similar question ( Fastest way to search a list of names in C# ) with a simple c# implementation of a Suffix Array.
The basic idea is you have an array of index pairs that point to a document index, and a position within that document. The index pair represents the string that starts at that point in the document, and continues to the end of the document. But the actual documents and their contents exist only once in your original store. The Suffix Array is just an array of these index pairs, with a pair for every position in every document. You then sort the Suffix Array in the order of the text they point to. Once sorted, you can now very quickly find any phrase among any of the documents by doing a simple Binary Search on the Suffix Array. Constructing (mainly sorting) the Suffix Array can be time consumptive. But once constructed, it is very fast to search on. It's fairly easy on memory since the actual document contents only exist once.
It would be trivial to extend it to returning counts of phrase matches within each document.
This is a little different than the classic description of a Suffix Array where they are usually talking about the Suffix Array operating over one single, very large string. But the changes to make it work for an array of strings/documents is not that large, although it can increase the amount of memory consumed by the Suffix Array depending on the maximum number of documents and the maximum document length, and how you encode the index pairs.

Returning 10 most frequently used words in a document in O(n)

How can I design an algorithm which can return the 10 most frequently used words in a document in O(n) time? If additional space can be used.
I can parse and place the words in a hash map with count . But next I have to sort the values to get the most frequent ones . Also I have to have a mapping btw the values -> Key which cannot be maintained since values may be repeating.
So how can I solve this ?
Here's a simple algorithm:
Read one word at a time through the document. O(n)
Build a HashTable using each word. O(n)
Use the word as the key. O(1)
Use the number of times you've seen this word as the value. O(1)
(e.g. If you are adding the key to the hashtable, then value is 1; if you already have the key in the hashtable, increase its associated value by 1) O(1)
Create a pair of arrays of size 10 (i.e. String words[10] / int count[10], or use a < Pair >), use this pair to keep track of the 10 most frequent words and their word counts in the next step. O(1)
Iterate through the completed HashTable once: O(n)
If the current word has a higher word count than an entry in the array pair, replace that particular entry and shift everything down 1 slot. O(1)
Output the pair of arrays. O(1)
O(n) Run-time.
O(n) Storage for the HashTable + Arrays
(Side note: You can just think of HashTable as a dictionary: a way to store key:value pairs where keys are unique. Technically HashMaps imply asynchronous access, and HashTable imply synchronous.)
It may be done in O(n) if you use the correct data structure.
Consider a Node, consisting of 2 things:
A counter (initially set to 0).
An array of 255 (or whatever number of characters) pointers to Node. All the pointers are initially set to NULL.
Create a root node. Define a "current" Node pointer, set it to root node initially.
Then walk through all the characters of the document and do the following:
If the next characters is not a space - pick the appropriate pointer from the array of the current node. If it's NULL - allocate it. The current Node pointer is updated.
If it's a space (or whatever word delimiter) - increment the counter of the "current" Node. Then reset the "current" Node pointer to point to the root node.
By such you build a tree in O(n). Every element (both node and leave) denote a specific word, together with its counter.
Then transverse the tree to find the node with the largest counter. It's also O(n), since the number of elements in the tree is not bigger than O(n).
Update:
The last step is not mandatory. Actually the most common word may be updated during the character processing.
The following is a pseudo-code:
struct Node
{
size_t m_Counter;
Node* m_ppNext[255];
Node* m_pPrev;
Node(Node* pPrev) :m_Counter(0)
{
m_pPrev = pPrev;
memset(m_ppNext, 0, sizeof(m_ppNext));
}
~Node()
{
for (int i = 0; i < _countof(m_ppNext) i++)
if (m_ppNext[i])
delete m_ppNext[i];
}
};
Node root(NULL);
Node* pPos = &root;
Node* pBest = &root;
char c;
while (0 != (c = GetNextDocumentCharacter()))
{
if (c == ' ')
{
if (pPos != &root)
{
pPos->m_Counter++;
if (pBest->m_Counter < pPos->m_Counter)
pBest = pPos;
pPos = &root;
}
} else
{
Node*& pNext = pPos->m_ppNext[c - 1];
if (!pNext)
pNext = new Node(pPos);
pPos = pNext;
}
}
// pBest points to the most common word. Using pBest->m_pPrev we iterate in reverse order through its characters
The fastest approach is to use a radix tree. You can store the count of words at the leaf of the radix tree. Keep a separate list of the 10 most frequent words and their occurrence count along with a variable that stores the threshold value necessary to get into this list. Update this list as items are added to the tree.
I would use a ArrayList and a HashTable.
Here is the algorithm I am thinking of,
Loop through all the word in the document.
if (HashTable.contains(word) )
increment count for that word in the HashTable;
else
ArrayList.add(word);
HashTable.add(word);
word count in HashTable = 1;
After looping through the whole document,
Loop through ArrayList<word>
Retrieve the word count for that word from the HashTable;
Keep a list of the top 10 words;
The running time should be O(n) to construct the HashTable and ArrayList.
Making the top 10 list should be O(m), where m is the number of unique words.
O(n+m) where n>>m --> O(n)
Maintaining a map of (word,count) will be O(n).
Once the map is built, iterate over the keys and retrieve ten most common keys.
O(n) + O(n)
--
But not exactly happy with this soln coz of the extra amount of external memory needed.

How to find repeated sequences of events

I'm trying to find an efficient algorithm for identifying a reoccurring sequence of characters. Let's say the sequence could be a minimum of 3 characters, yet only returns the maximum length sequence. The dataset could potentially be thousands of characters. Also, I only want to know about the sequence if it's repeated, lets say, 3 times.
As an example:
ASHEKBSHEKCSHEDSHEK
"SHEK" occurs 3 times and would be identified. "SHE" occurs 4 times, but isn't identified since "SHEK" is the maximum length sequence that contains that sequence.
Also, no "seed" sequence is fed to the algorithm, it must find them autonomously.
Thanks in advance,
j
Try to create suffix array for string.
Online builder: http://allisons.org/ll/AlgDS/Strings/Suffix/
Check the beginning of consecutive lines in suffix array to match
Looks like Rabin-Karp Wiki Entry
If you consider that there exist \sum(n) / 2 possible starting strings, and you aren't looking for simply a match, but the substring with the most matches, I think your algorithm will have a terrible theoretical complexity if it is to be correct and complete.
However, you might get some practical speed using a Trie. The algorithm would go something like this:
For each offset into the string...
1 For each length sub-string...
Insert it into the trie. Each node in the trie has a data value (an "count" integer) that you increment by when you visit the node.
Once you've built-up the trie to model your data, delete all the sub-trees from the trie with roots below some optimization threshold (3 in your case).
Those remaining paths should be few enough in number for you to efficiently sort-and-pick the ones you want.
I suggest this as a starting point because the Trie is built to manipulate common prefixes and, as a side-effect, will compress your dataset.
A personal choice I would make would be to identify the location of the sub-strings as a separate process after identifying the ones I want. Otherwise you are going to store every substring location, and that will explode your memory. Your computation is already pretty complex.
Hope this makes some sense! Good luck!
Consider the following algorithm, where:
str is the string of events
T(i) is the suffix tree for the substring str(0..i).
T(i+1) is quickly obtained from T(i), for example using this algorithm
For each character position i in the input string str, traverse a
path starting at the root of T(i), along edges, labeled with
successive characters from the input, beginning from position i + 1.
This path determines a repeating string. If the path is longer than
the previously found paths, record the new max length and the position
i + 1.
Update the suffixe tree with str [i+1] and repeat for the next position.
Something like this pseudocode:
max.len = 0
max.off = -1
T = update_suffix_tree (nil, str [0])
for i = 1 to len (str)
r = root (T)
j = i + 1
while j < len (str) and r.child (str [j]) != nil
r = r.child (str [j])
++j
if j - i - 1 > max.len
max.len = j - i - 1
max.off = i + 1
T = update_suffix_tree (T, str [i+1])
In the kth iteration, the inner while is executed for at most n -
k iterations and the suffix tree construction is O(k), hence the
loop body's complexity is O(n) and it's executed n-1 times,
therefore the whole algorithm complexity is O(n^2).

Categories

Resources