Complexity of grouping in Java8

Complexity of grouping in Java8 - java

I would like to learn the time complexity of the given statement below.(In Java8)
list.stream().collect(groupingBy(...));
Any idea?

There is no general answer to that question, as the time complexity depends on all operations. Since the stream has to be processed entirely, there is a base time complexity of O(n) that has to be multiplied by the costs of all operations done per element. This, assuming that the iteration costs itself are not worse than O(n), which is the case for most stream sources.
So, assuming no intermediate operations that affect the time complexity, the groupingBy has to evaluate the function for each element, which should be independent of other elements, so not affect the time complexity (regardless of how expensive it is, as the O(…) time complexity only tells us, how the time scales with large numbers of stream elements). Then, it will insert the element into a map, which might depend on the number of already contained elements. Without a custom Map supplier, the map’s type is unspecified, hence, no statement can be made here.
In practice, it’s reasonable to assume that the result will be some sort of hashing map with a net O(1) lookup complexity by default. So we have a net time complexity of O(n) for the grouping. Then, we have the downstream collector.
The default downstream collector is toList(), which produces an unspecified List type, so again, we can’t say anything about the costs of adding elements to it.
The current implementation produces an ArrayList, which has to perform copy operations when the capacity is exceeded, but since the capacity is raised by a factor each time, there is still a net complexity of O(n) for adding n elements. It’s reasonable to assume that future changes to the toList() implementation won’t make the costs worse than what we have today. So the time complexity of a default groupingBy collection is likely O(n).
If we use a custom Map collector with a custom downstream collector, the complexity depends on the average number of groups to number of elements per group ratio. The worst case would be the worst of either, the map’s lookup and the downstream collector’s element processing (times the number of elements), as we could have one group containing all items or each item being in its own group.
But usually, you are capable of predicting a bias for a particular grouping operation, so you would want to calculate a time complexity for that particular operation, instead of relying on a statement about all grouping operations in general.

Related

Time Complexity of Piece Table and Rope

I was reading about the data structures behind various text editors and had a couple of questions.
If I understand this correctly, searching/traversing on Piece Table is O(n) as you need to go through the linked list one by one, whereas Rope only requires O(logn) time as it is structured as a balanced binary tree.
Then how come in this stackoverflow answer and this IBM article, it claimed that "Insertion become nearly constant time operations" and "in most data-processing routines, sequential access to a rope's characters is what's required, in which case a rope iterator can provide amortized O(1) access speed". Am I missing something here? Shouldn't them all be O(logn) since they need to find the position of the leaf first?
All feedbacks/answers are welcomed!
Cheers!

You asked two questions:
Q1. How can "insertion become nearly constant time operations"?
A1. Because some people believe that O(log n) is "nearly constant time". You might disagree with those people.
Q2. How can "sequential access to a rope's characters is what's required, in which case a rope iterator can provide amortized O(1) access speed"?
A2. Sequential access is different than search. You posited "searching/traversing on Piece Table is O(n) as you need to go through the linked list one by one, whereas Rope only requires O(logn) time", but that statement lumps together two operations (searching and traversing) that have different access costs.
Traversing a data structure always takes at least Ω(n) time, because you have to traverse each element. On the other hand, search can be O(1) for some data structures.
Now that we have that straightened out, let's examine the statement you question: what does "sequential access ... amortized O(1)" mean? It means that you can search for a given location and then traverse sequentially after it in O(1) amortized time. These bounds on balanced search trees are typically O(log n + k), where k is the number of items traversed. If k is Ω(log n), this is O(k), which is O(1) per item.

Time complexity in Java

For the method add of the ArrayList Java API states:
The add operation runs in amortized constant time, that is, adding n elements requires O(n) time.
I wonder if it is the same time complexity, linear, when using the add method of a LinkedList.

This depends on where you're adding. E.g. if in an ArrayList you add to the front of the list, the implementation will have to shift all items every time, so adding n elements will run in quadratic time.
Similar for the linked list, the implementation in the JDK keeps a pointer to the head and the tail. If you keep appending to the tail, or prepending in front of the head, the operation will run in linear time for n elements. If you append at a different place, the implementation will have to search the linked list for the right place, which might give you worse runtime. Again, this depends on the insertion position; you'll get the worst time complexity if you're inserting in the middle of the list, as the maximum number of elements have to be traversed to find the insertion point.
The actual complexity depends on whether your insertion position is constant (e.g. always at the 10th position), or a function of the number of items in the list (or some arbitrary search on it). The first one will give you O(n) with a slightly worse constant factor, the latter O(n^2).

In most cases, ArrayList outperforms LinkedList on the add() method, as it's simply saving a pointer to an array and incrementing the counter.
If the woking array is not large enough, though, ArrayList grows the working array, allocating a new one and copying the content. That's slower than adding a new element to LinkedList—but if you constantly add elements, that only happens O(log(N)) times.
When we talk about "amortized" complexity, we take an average time calculated for some reference task.
So, answering your question, it's not the same complexity: it's much faster (though still O(1)) in most cases, and much slower (O(N)) sometimes. What's better for you is better checked with a profiler.

If you mean the add(E) method (not the add(int, E) method), the answer is yes, the time complexity of adding a single element to a LinkedList is constant (adding n elements requires O(n) time)
As Martin Probst indicates, with different positions you get different complexities, but the add(E) operation will always append the element to the tail, resulting in a constant (amortized) time operation

Map/ArrayList: which one is faster to search for an element

I have a gigantic data set which I've to store into a collection and need to find whethere any duplicates in there or not.
The data size could be more than 1 million. I know I can store more element in ArrayList comapre to a Map.
My questions are:
is searching key in a Map faster than searching in sorted ArrayList?
is searching Key in HashMap is faster than TreeMap?
Only in terms of space required to store n elements, which would be more efficient between a TreeMap and a HashMap implementation?

1) Yes. Searching an ArrayList is O(n) on average. The performance of key lookups in a Map depends on the specific implementation. You could write an implementation of Map that is O(n) or worse if you really wanted to, but all the implementations in the standard library are faster than O(n).
2) Yes. HashMap is O(1) on average for simple key lookups. TreeMap is O(log(n)).
Class HashMap<K,V>
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
Class TreeMap<K,V>
This implementation provides guaranteed log(n) time cost for the containsKey, get, put and remove operations. Algorithms are adaptations of those in Cormen, Leiserson, and Rivest's Introduction to Algorithms.
3) The space requirements will be O(n) in both cases. I'd guess the TreeMap requires slightly more space, but only by a constant factor.

It depends on the type of Map you're using.
A HashMap has a constant-time average lookup (O(1)), while a TreeMap's average lookup time is based on the depth of the tree (O(log(n))), so a HashMap is faster.
The difference is probably moot. Both data structures require some amount of constant overhead in space complexity by design (both exhibit O(n) space complexity).

It just did some benchmark testing on lookup performance between hashmap and sorted arraylist. The answer is hashmap is much faster as the size increase. I am talking about 10x, 20x, 30x faster. I did some test with 1 million of entries using sorted array list and hashmap and the array list get and add operation took seconds to complete, where as the hashmap get and put only takes around 50ms.
Here are something I found or observed:
For sorted arraylist, you would have to sort it first to be able to use the search efficiently (binarySearch for example). Practically you don't just have static list (meaning the list will change via add or remove). With that in mind you will need to change the add and the get methods to do "binary" operation to make it efficient (like binarySearch). So even with binary operation the add and get method will be slower and slower as the list grows.
Hashmap on the other hand does not show much of change in term of time in the put and get operation. The problem with Hashmap is memory overhead. If you can live with that then go with hashmap.

Which is the appropriate data structure?

I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?

O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.

If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.

If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...

you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.

I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.

Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers

Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.

As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.

When should I use a TreeMap over a PriorityQueue and vice versa?

Seems they both let you retrieve the minimum, which is what I need for Prim's algorithm, and force me to remove and reinsert a key to update its value. Is there any advantage of using one over the other, not just for this example, but generally speaking?

Generally speaking, it is less work to track only the minimum element, using a heap.
A tree is more organized, and it requires more computation to maintain that organization. But if you need to access any key, and not just the minimum, a heap will not suffice, and the extra overhead of the tree is justified.

There are 2 differences I would like to point out (and this may be more relevant to Difference between PriorityQueue and TreeSet in Java? as that question is deemed a dup of this question).
(1) PriorityQueue can have duplicates where as TreeSet can NOT have dups. So in Treeset, if your comparator deems 2 elements as equal, TreeSet will keep only one of those 2 elements and throw away the other one.
(2) TreeSet iterator traverses the collection in a sorted order, whereas PriorityQueue iterator does NOT traverse in sorted order. For PriorityQueue If you want to get the items in sorted order, you have to destroy the queue by calling remove() repeatedly.
I am assuming that the discussion is limited to Java's implementation of these data structures.

Totally agree with Erickson on that priority queue only gives you the minimum/maximum element.
In addition, because the priority queue is less powerful in maintaining the total order of the data, it has the advantage in some special cases. If you want to track the biggest M elements in an array of N, the time complexity would be O(N*LogM) and the space complexity would be O(M). But if you do it in a map, the time complexity is O(N*logN) and the space is O(N). This is quite fundamental while we must use priority queue in some cases for example M is just a constant like 10.

Rule of thumb about it is:
TreeMap maintains all elements orderly. (So intuitively, it takes time to construct it)
PriorityQueue only guarantees min or max. It's less expensive but less powerful.

It all depends what you want to achieve. Here are the main points to consider before you choose one over other.
PriorityQueue Allows Duplicate(i.e with same priority) while TreeMap doesn't.
Complexity of PriorityQueue is O(n)(when is increases its size), while that of TreeMap is O(logn)(as it is based on Red Black Tree)
PriorityQueue is based on Array while in TreeMap nodes are linked to each other, so contains method of PriorityQueue would take O(n) time while TreeMap would take O(logn) time.

One of the differences is that remove(Object) and contains(Object) are linear O(N) in a normal heap based PriorityQueue (like Oracle's), but O(log(N)) for a TreeSet/Map.
So if you have a large number of elements and do a lot of remove(Object) or contains(Object), then a TreeSet/Map may be faster.

I may be late to this answer but still.
They have their own use-cases, in which either one of them is a clear winner.
For Example:
1: https://leetcode.com/problems/my-calendar-i TreeMap is the one you are looking at
2: https://leetcode.com/problems/top-k-frequent-words you don't need the overhead of keys and values.
So my answer would be, look at the use-case, and see if that could be done without key and value, if yes, go for PQueue else move to TreeMap.

It depends on how you implement you Priority Queue. According to Cormen's book 2nd ed the fastest result is with a Fibonacci Heap.

I find TreeMap to be useful, when there is a need to do something like:
find the minimal/least key, which is greater equal some value, using ceilingKey()
find the maximum/greatest key, which is less equal some value, using floorKey()
If the above is not required, and it's mostly about having a quick option to retrieve the min/max - PriorityQueue might be preferred.

Their difference on time complexity is stated clearly in Erickson's answer.
On space complexity, although a heap and a TreeMap both take O(n) space complexity, building them in actual programs takes up different amount of space and effort.
Say if you have an array of numbers, you can build a heap in place with O(n) time and constant extra space. If you build a TreeMap based on the given array, you need O(nlogn) time and O(n) extra space to accomplish that.

One more thing to take into consideration, PriorityQueue offers an api which return the max/min value without removing it, the time complexity is O(1) while for a TreeMap this will still cost you O(logn)
This could be clear advantage in case of readonly cases where you are only interested in the top end value.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.