I would like to know what is the fastest way / algorithm of checking for the existence of a word in a String array. For an example, if I have a String array with 10,000 elements, I would like to know whether it has the word "Human". I can sort the array, no problem.
However, binary search (Arrays.binarySearch()) is not allowed. Other collection types like HashSet, HashMap and ArrayList are not allowed too.
Is there any proven algorithm for this? Or any other method? The way of searching should be really really fast.
the fastest way you can sort will result in O(nLogn) complexity
so if you are looking for particular word in unordered data just scan the array with single for cycle, that will cost you O(n)
For fastest performance you have to use hashing .
You can use rolling hash .
It ensures lesser number of collisions .
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]
where base is a prime number , say 31 .
You need to take modulo also , so integer range is not exceeded , by a prime number .
Time complexity : O(number of characters) considering multiplication and modulo O(1) operation .
A very good explaination is given here : Fast implementation of Rolling hash
Build a trie out of the array. It can be built in linear time (assuming a constant size alphabet). Then you can query in linear time as well (time proportional to the query word length). Both preprocessing and query time are asymptotically optimal.
Related
I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.
With reference to the question , I have found out the accepted answer used Java Collections API to get the index. My question is there are so many other methods to solve the given problem, which would be the optimal solution ?
Use two loops
Use sorting and binary search
Use sorting and merging
Use hashing
Use Collections api
Use two loops will take O(n ^ 2) time
Use sorting and binary search will take O(nlog n) time
Use sorting and merging O(nlog n) time
Use hashing will take O(k * n) time with some other overhead and additional space.
Use Collections api will take O(n ^ 2) time as its use native algorithm under the hood
You can do it in optimal way along with the ways mentioned above by using Knuth–Morris–Pratt algorithm in linear O(n + m) time complexity where n and m are the lengths of the two arrays.
KMP algorithm is basically a pattern matching algorithm(finding the starting position of a needle in haystack) which works on character string. But you can easily use it for integer array.
You can do some benchmark test for all those implementations and choose which one is efficient enough to suit your requirement.
I have N numbers in arraylist. To get the indexOf, arraylist will have to iterate maximum N times, so complexity is O(N), is that correct?
Source Java API
Yes,Complexity is O(N).
The size, isEmpty, get, set, iterator, and listIterator operations run in constant time. The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking). The constant factor is low compared to that for the LinkedList implementation.
Yes it's O(n) as it needs to iterate through every item in the list in the worst case.
The only way to achieve better than this is to have some sort of structure to the list. The most typical example being looking through a sorted list using binary search in O(log n) time.
Yes, that is correct. The order is based off the worst case.
100%, it needs to iterate through the list to find the correct index.
It is true. Best Case is 1 so O(1), Average Case is N/2 so O(N) and Worst Case is N so O(N)
In the worst case you find the element at the very last position, which takes N steps, that is, O(N). In the best case the item you are searching for is the very first one, so the complexity is O(1). The average length is of the average number of steps. If we do not have further context, then this is how one can make the calculations:
avg = (1 + 2 + ... n) / n = (n * (n + 1) / 2) / n = (n + 1) / 2
If n -> infinity, then adding a positive constant and dividing by a positive constant has no effect, we still have infinity, so it is O(n).
However if you have a large finite data to work with, then you might want to calculate the exact average value as above.
Also, you might have a context there which could aid you to get further accuracy in your calculations.
Example:
Let's consider the example when your array is ordered by usage frequency descendingly. In case that your call of indexOf is a usage, then the most probable item is the first one, then the second and so on. If you have exact usage frequency for each item, then you will be able to calculate a probable wait time.
An ArrayList is an Array with more features. So the order of complexity for operations done to an ArrayList is the same as for an Array.
why do we use hashing for search? what are advantages of using hashing over binary search tree?
Hashing is generally a constant time operation whereas a Binary Tree has a logarithmic time complexity.
Because a hash is calculated not based on the number of items in the collection but on the item being searched for, the size of the collection has no bearing on the time it takes to find an item. However most hashing algorithms will have collisions which then increases the time complexity so it's very unlikely to get a perfect constant time lookup.
With a binary tree, you have to do up to log2N comparisons before the item can be found.
Wikipedia explains it well:
http://en.wikipedia.org/wiki/Hash_table#Features
Summary: Inserts are generally slow, reads are faster than trees.
As for Java: Any time you have some key/value pair that you read a lot and write not very often and everything easily fits into RAM, use a HashTable for quick read accesses and incredible easy of code maintenance.
Hashing means using some function or
algorithm to map object data to some
representative integer value. This
so-called hash code (or simply hash)
can then be used as a way to narrow
down our search when looking for the
item in the map.
If need to use an algorithm that is
fast for looking up the information
that you need then the HashTable is
the most suitable algorithm to use, as
it is simply generating a hash of your
key object and using that to access
the target data - it is O(1). The
others are O(N) (Linked Lists of size
N - you have to iterate through the
list one at a time, an average of N/2
times) and O(log N) (Binary Tree - you
halve the search space with each
iteration - only if the tree is
balanced, so this depends on your
implementation, an unbalanced tree can
have significantly worse performance).
Hash Tables are best for searching(=) if you have lower inserts and uniform slot distribution. The time complexity is O(n+k) - linear.
They are not a good idea if you want to do comparison operations (<, >)
I want to write a java program that searches through a cipher text and returns a frequency count of the characters in the cipher, for example the cipher:
"jshddllpkeldldwgbdpked" will have a result like this:
2 letter occurrences:
pk = 2, ke = 2, ld = 2
3 letter occurrences:
pke = 2.
Any algorithm that allows me to do this as efficiently as possible?
The map strategy is a good one, but I'd go for HashMap<String, Integer>, since it's tuples of characters being counted.
Iterating over the characters in the ciphertext, you can save the last X characters and that will give you a map over all occurrences of substrings of length X+1.
The usual approach would be to use some kind of map to map your characters to their counts. You can use a HashMap<Character, Integer> for example. You can then iterate through your ciphertext, character-wise and either put the character into the map with a count of 1 (if it doesn't yet exist) or increment its count.
You could store the n-grams in a trie, reversing the normal order so the last character in the n-gram is at the top of the trie. Each node in the trie stores a character count. Loop over the string, keeping track of the last N characters (as Buhb suggests). Each time through the outer loop, you traverse the trie, using the last N characters to pick the path, starting with the last character and ending with the Nth to last. For each node you visit, incrementing its counter.
To print the n-gram frequencies, perform a breadth-first traversal of the trie.
Overall performance left as an exercise.
Either have an array with a cell for each possible value (easy if the cipher text is all lower case characters - 26 - harder if not) or go for a Map where you pass in the character and increment the value in either case. The array is quicker but less flexible.
If the set of lengths of sequences you need is fixed, the obvious algorithm takes a linear number of counting operations (say, looking up a counter in a hashtable and incrementing it).
When you say "as efficiently as possible", do you propose to spend a lot of effort for a meagre constant-factor improvement, to search hopelessly for a sublinear algorithm, or do you not understand algorithm complexity classes at all?
You can use hash or graph (Thanks to outis, I know it's special name now, such kind of graphs is called "trie"). Hash will be slower, graph will be faster. Hash will get less memory, graph will take more in bad implementation.
You cannot get it done using array since it will get HUGE amount of memory if your maximum char sequence length is equal to your text length, and text is long enough. If you will limit it it will get smth like ([number of letters]^[max sequence length])*4 bytes which will be (52^4)*4 ~= 24Mb of memory for 4 lower/upper letter sequence. If limited sequence length is ok for you and this memory amount is normal than algorithm will be pretty easy for <=4 letters in sequence.
You could start by looking for the largest possible repeatable sequence first then work your way down from there. For example if the string is 10 characters the largest repeatable sequence that could occur would be 5 letters long so first look for 5 letter sequences then 4 letters and so on till you reach 2. This should reduce the number of iterations in your program.
I dont have an answer in mind for this,
But I feel, this algorithm is the exact same as the algorithm used by compression algorithms to create compressed files with the dictionary approach.
If I am not wrong, in this approach, a dictionary is used in the following manner:
data:
abccccabaccabcaaaaabcaaabbbbbccccaaabcbbbbabbabab
parse 1 :
key: *
value: abc
new data:
*cccabacc*aaaa*aaabbbbbccccaa*bbbbabbabab
Just an educated guess, I think (not sure here) the standard "zip" file uses this approach,
so I suggest you look at these algorithms