Which data structure can be used for storing a set of integers such that each of the following operations can be done in O(log N) time, where N is the number of elements?
deletion of the smallest element
insertion of a element if it is not already present in the set
PICK ONE OF THE CHOICES
A heap can be used, but not a balanced binary search tree
A balanced binary search tree can be used, but not a heap
Both balanced binary search and heap can be used
Neither balanced binary search tree nor heap can be used
I think that the second one, "A balanced binary search tree can be used, but not a heap", because the worst case complexity of inserting and finding of a Balanced Search Tree is logN.
And we cannot use a Heap, because, for example in Binary Heap, which is the faster, the worst case of finding is N.
A balanced binary search tree can be used, but not a heap
Because,
Balanced binary tree has it's smallest elements in the leaves. Therefore, no overhead once the smallest element is identified. To identify, you have to check log(N) number of nodes.
When inserting and element, all you do is, traverse till you find the position (you will have to traverse maximum of log(N) nodes) and add the new element as right or left child.
But in heap, inserting an element make it call build heap, which is Nlog(N).
Checking whether the element exists in the tree can be done with a sligh modification in constant time in balanced binary tree.
Related
I have a set of numbers, [1, 3, 4, 5, 7]. I want to select a number from the set with a probability proportional to its value:
Number
Probability
%
1
1/20
5
3
3/20
15
4
4/20
20
5
5/20
25
7
7/20
35
However, I want to be able to both update and query this set in less than O(n). Are there any data structures that would help me achieve that?
Preferably in Java, if it exists already
You can get O(log n) amortized querying and updating (including removing/inserting elements) using a Binary Indexed Tree, also known as a Fenwick tree. The idea is to use dynamic resizing, which is the same trick used in variable-size arrays and hash tables to get amortized constant time appends. This also implies that you should be able to get O(log n) worst-case bounds using the method from dynamic arrays of rebuilding a second array on the side, but this makes the code significantly longer.
First, we know that given a list of the partial sums of arr, and a random integer in [0, sum(arr)], we can do this in O(log n) time with a binary search. Specifically, if our random integer is r, we want the index of the rightmost partial sum less than or equal to r.
Now, we'll use the technique from this post of Fenwick trees to maintain and query the partial sums. That post is slightly different from yours: they have a fixed set of n keys, whose weights can be updated, without new insertions or deletions.
A Fenwick tree is an array that allows you to answer queries about partial sums of a 'base' array in O(log n) time per query, and can be built in O(n) time. In particular, you can
Find the the index of the rightmost partial sum of arr less than or equal to r,
Set arr[i] to arr[i]+c for any integer c,
both in O(log n) time.
Start by appending n zeros to arr (it is now half full), and build its Fenwick tree. We can treat 'removing' an element as setting its weight to 0. Inserting an element is done by taking the zero after the rightmost nonzero element in arr as the new element's spot. The removed elements and new elements may eventually cause our array to fill up: if we reach 75% capacity, rebuild our array and Fenwick tree, doubling the array size (pad with zeros on the right) and deleting all the zero-weight elements. If we reach 25% capacity, shrink the array to half size, rebuilding the Fenwick tree as well.
You'll need to maintain arr constantly to be able to rebuild, so all updates must be done on arr and the Fenwick tree. You'll also need a hashmap from array indices to your keys for random selection.
The good part is that you don't need to modify the Fenwick tree internals at all: given a Fenwick tree implementation in Java that supports initialization, array updates and the binary search, you can treat it as a black box. This stops being true if you want worst-case time guarantees: then, you'll need to copy the internal state of the Fenwick tree, piece by piece, which has some complications.
At what Array Size is it better to use the Sequential Search over the Binary Search (needs to be sorted First) for these specific situations. The first case is when all the values of the array are just random numbers and not sorted. The second case is when the values or random numbers of the array are sorted numerically from least to greatest or greatest to least. For the searches assume you are only trying to find one number in the array.
Case 1: Random numbers
Case 2: Already sorted
The Sequential Search algorithm has a worst case running time of O(n) and does not depend on whether the data is sorted or not.
The Binary Search algorithm has a worst case running time of O(logn), however, in order to use algorithm the data must be sorted. If the data is not sorted, sorting the data will take O(nlogn) time.
Therefore:
Case 1: When the data is not sorted, a Sequential Search will be more time efficient as it will take O(n) time. A Binary Search would require the data to be sorted in O(nlogn) and then searched O(logn). Therefore, the time complexity would be O(nlogn) + O(logn) = O(nlogn).
Case 2: When the data is sorted, a Binary Search will be more time efficient as it will only take O(logn) time, while the Sequential Search will still take O(n) time.
Binary search is always better. But binary search always requires to array should be sorted.
Why contains(Object) is log(n) in TreeMap but O(n) in PriorityQueue, while PriorityQueue uses binary heap (special kind of a binary tree) internally? They both use a tree, but PriorityQueue's contains is O(n).
Not all binary trees support O(log(N)) searches. Balanced binary search trees support that, but the tree underlying a PriorityQueue is a binary heap, not a binary search tree. With a binary search tree, you can tell which subtree to search in at each step. With a binary heap, the heap invariant isn't enough to determine where to look for an element.
This is the implementation of add in Binary Search Tree from BST Add
private IntTreeNode add(IntTreeNode root, int value) {
if (root == null) {
root = new IntTreeNode(value);
} else if (value <= root.data) {
root.left = add(root.left, value);
} else {
root.right = add(root.right, value);
}
return root;
}
I understand why this runs in O(log n). Here's how I analyze it. We have a tree size of n. How many cuts of 2, or half cut, will reduce this tree down to a size of 1. So we have the expression n(1/2)^x = 1 where the 1/2 represents each half cut. Solving this for x, we have log2(x) so the logn comes from search.
Here is a lecture slide from Heap that discusses runtime for an unbalanced binary search.
My question is even if the binary search tree is unbalanced, wouldn't the same strategy work for analyzing the runtime of add? How many cuts you have to make. Wouldn't the runtime still be O(log n), not O(n)? If so, can someone show the math of why it would be O(n)?
With an unbalanced tree:
1
\
2
\
3
\
4
\
5
\
...
Your intuition of cutting the tree in half with each operation no longer applies. This unbalanced tree is the worst case of an unbalanced binary search tree. To search for 10 at the bottom of the list, you must make 10 operations, one for each element in the tree. That is why a search operation for an unbalanced binary search tree is O(n) - this unbalanced binary search tree is equivalent to a linked list. Each operation doesn't cut off half the tree -- just the one node you've already visited.
That is why specialized versions of binary search trees, such as red-black trees and AVL trees are important: they maintain trees that are balanced well enough so that all operations - search, insert, delete -- are still O(log n).
The O(n) situation in a BST happens when you have either the minimum or the maximum at the top, effectively turning your BST into a linked list. Suppose you added elements as: 1, 2, 3, 4, 5, generating your BST, which will be a linked list due to every element having only a right child. Adding 6 would have to go down right on every single node, going through all the elements, hence making the asymptotic complexity of add O(n)
I am confused on the performance analysis of binarySearch from the Collections
It says:
If the specified list does not implement the RandomAccess interface
and is large, this method will do an iterator-based binary search that
performs O(n) link traversals and O(log n) element comparisons.
I am not sure how to interpret this O(n) + O(log n).
I mean isn't it worse than simply traversing the linked-list and compare? We still get only O(n).
So what does this statement mean about performance? As phrased, I can not understand the difference from a plain linear search in the linked list.
What am I missunderstanding here?
First of all you must understand that without RandomAccess interface the binarySearch cannot simply access, well, random element from the list, but instead it has to use an iterator. That introduces O(n) cost. When the collection implements RandomAccess, cost of each element access is O(1) and can be ignored as far as asymptotic complexity is concerned.
Because O(n) is greater than O(log n) it will always take precedence over O(log n) and dominate the complexity. In this case binarySearch has the same complexity as simple linear search. So what is the advantage?
Linear search performs O(n) comparisons, as opposed to O(log n) with binarySearch without random access. This is especially important when the constant before O(logn) is high. In plain English: when single comparison has a very high cost compared to advancing iterator. This might be quite common scenario, so limiting the number of comparisons is beneficial. Profit!
Binary search is not suited for linked lists. The algorithm is supposed to benefit from a sorted collection with random access (like a plain array), where it can quickly jump from one element to another, splitting the remaining search space in two on each iteration (hence the O(log N) time complexity).
For a linked list, there is a modified version which iterates through all elements (and needs to go through 2n elements in the worst case), but instead of comparing every element, it "probes" the list at specified positions only (hence doing a lower number of comparisons compared to a linear search).
Since comparisons are usually slightly more costly compared to plain pointer iterating, total time should be lower. That is why the log N part is emphasized separately.