time range check with bit array and bit mask - java

I have a Java collection of <String username, ArrayList loginTimes>. For example, one entry might look like ["smith", [2012-10-2 08:04:23, 2012-10-4 06:34:21]]. The times have one second resolution. I am looking to output a list of usernames for all users that have logged in at least twice in a period that is more than 24 hrs apart but less than 7 days apart.
There is a simple O(n^2) way to do this where for a given user you compare each login time to every other login time and check to see if they match the required conditions. There are also a few O(nlogn) methods, such as storing the loginTimes as a binary search tree, and for each login time (N of them), look through the tree (log N) to see if there is another login time to match the requirements.
My understanding is that there is also a solution (O(n) or better?) where you create a bit array (BitSet) from the login times, and use some sort of a mask to check for the required conditions (at least two login times 24 hrs apart but less than 7 days apart). Anybody know how this could be achieved? Or other possible efficient (O(n) or better) solutions?

You can do it in O(M * NlogN) where M is the no. of users (size of the collection) and N the average length of loginTimes (it's an array):
For every user in the collection do:
1- sort the list loginTimes. This is a O(NlogN) task
2- scan the list and search if your constraints apply. This can be done in O(N) time.
So, for every user the total cost is O(N) + O(NlogN) => O(2N*logN) => O(NlogN)

Related

Why is a Hash Table with linked lists considered to have constant time complexity?

In my COMP class last night we learned about hashing and how it generally works when trying to find an element x in a hash table.
Our scenario was that we have a dataset of 1000 elements inside our table and we want to know if x is contained within that table.
Our professor drew up a Java array of 100 and said that to store these 1000 elements that each position of the array would contain a pointer to a Linked List where we would keep our elements.
Assuming the hashing function perfectly mapped each of the 1000 elements to a value between 0 and 99 and stored the element at the position in the array, there would be 1000/100 = 10 elements contained within each linked list.
Now to know whether x is in the table, we simply hash x, find it's hash value, lookup into the array at that slot and iterate over our linked list to check whether x is in the table.
My professor concluded by saying that the expected complexity of finding whether x is in the table is O(10) which is really just O(1). I cannot understand how this is the case. In my mind, if the dataset is N and the array size is n then it takes on average N/n steps to find x in the table. Isn't this by definition not constant time, because if we scale up the data set the time will still increase?
I've looked through Stack Overflow and online and everyone says hashing is expected time complexity of O(1) with some caveats. I have read people discuss chaining to reduce these caveats. Maybe I am missing something fundamental about determining time complexity.
TLDR: Why does it take O(1) time to lookup a value in a hash table when it seems to still be determined by how large your dataset is (therefore a function of N, therefore not constant).
In my mind, if the dataset is N and the array size is n then it takes on average N/n steps to find x in the table.
This is a misconception, as hashing simply requires you to calculate the correct bucket (in this case, array index) that the object should be stored in. This calculation will not become any more complex if the size of the data set changes.
These caveats that you speak of are most likely hash collisions: where multiple objects share the same hashCode; these can be prevented with a better hash function.
The complexity of a hashed collection for lookups is O(1) because the size of lists (or in Java's case, red-black trees) for each bucket is not dependent on N. Worst-case performance for HashMap if you have a very bad hash function is O(log N), but as the Javadocs point out, you get O(1) performance "assuming the hash function disperses the elements properly among the buckets". With proper dispersion the size of each bucket's collection is more-or-less fixed, and also small enough that constant factors generally overwhelm the polynomial factors.
There is multiple issues here so I will address them 1 by 1:
Worst case analysis vs amortized analysis:
Worst case analysis refers to the absolute worst case scenario that your algorithm can be given relative to running time. As an example, if I am giving an array of unordered elements, and I am told to find an element in it, my best case scenario is when the element is at index [0] the worst possible thing that I can be given is when the element is at the end of the array, in which case if my data set is n, I run n times before finding the element. On the average case however the element is anywhere in the array so I will run n-k steps (where k is the number of elements after the element I am looking for in the array).
Worst case analysis of Hashtables:
There exists only 1 kind of Hashtable that has guaranteed constant time access O(1) to it's elements, Arrays. (And even then it's not actually true do to paging and the way OS's handle memory). The worst possible case that I could give you for a hash table is a data set where every element hashes to the same index. So for example if every single element hashes to index 1, due to collisions, the worst case running time for accessing a value is O(n). This is unavoidable, hashtables always have this behaviour.
Average and best case scenario of hashtables:
You will rarely be given a set that gives you the worst possible case scenario. In general you can expect objects to be hashed to different indexes in your hashtable. Ideally the hash function hashes things in a very spread out manner so that objects get hashed to different indexes in the hash table.
In the specific example your teacher gave you, if 2 things get hashed to the same index, they get put in a linked list. So this is more or less how the table got constructed:
get element E
use the hashing function hash(E) to find the index i in the hash table
add e to the linjed list in hashTable[i].
repeat for all the elements in the data set
So now, let's say I want to find whether element E is on the table. Then:
do hash(E) to find the index i where E is potentially hashed
go to hashTable[i] and iterate through the linked list (up to 10 iterations)
If E is found, then E is in the Hash table, if E is not found, then E is not in the table
The reason why we can guarantee that E is not in the table if we can't find it, is because if it was, it would have been hashed to hashTable[i] so it HAS to be there, if it's on the table.

Better algorithmic approach to showing trends of data per week

Suppose I have a list of projects with start date and end date. I also have a range of weeks, which varies (could be over months, years, etc)
I would like to display a graph showing 4 values per week:
projects started
projects closed
total projects started
total projects closed
I could loop over the range of weekly values, and for each week iterate through my list of projects and calculate values for each of these 4 trends per week. This would have algorithmic complexity O(nm), n is the length of list of weeks, and m is the length of projects list. That's not so great.
Is there a more efficient approach, and if so, what would it be?
If it's pertinent, I'm coding in Java
While it is true what user yurib has said there is a more efficient solution. Keep two arrays in memory projects_started and projects_ended, both with size 52. Loop through your list of projects and for each project increment corresponding value in both lists. Something like:
projects_started[projects[i].start_week]++;
projects_ended[projects[i].end_week]++;
After the loop you have all the data you need to make a graph. Complexity is O(m).
EDIT: okay, so maximum number of weeks can vary apparently, but if it's smaller than some ludicrous number (more than say a million) then this algorithm still works. Just replace 52 with n. Time complexity is O(m), space complexity is O(n).
EDIT: in order to determine the value of total projects started and ended you have to iterate through the two arrays that you now have and just add up the values. You could do this while populating the graph:
for (int i = 0; i < n)
{
total_started_in_this_week += projects_started[i];
total_ended_in_this_week += projects_ended[i];
// add new item to the graph
}
I'm not sure what the difference between "project" and "total" is, but here's a simple O(n log n) way to calculate the number of projects started and closed in each week:
For each project, add its start and end points to a list.
Sort the list in increasing order.
Walk through the list, pulling out time points until you hit a time point that occurs in a later week. At this point, "projects started" is the total number of start points you have hit, and "projects ended" is the total number of end points you have hit: report these counters, and reset them both to zero. Then continue on to process the next week.
Incidentally, if there are some weeks without any projects that start or end, this procedure will skip them out. If you want to report these weeks as "0, 0" totals, then whenever you output a week that has some nonzero total, make sure you first output as many "0, 0" weeks as it takes to fill in the gap since the last nonzero-total week. (This is easy to do just by setting a lastNonzeroWeek variable each time you output a nonzero-total week.)
First of all, I guess that actually performance won't be an issue; this looks like a case of "premature optimization". You should first do it, then do it right, then do it fast.
I suggest you use maps, which will make your code more readable and outsources implementation details (like performance).
Create a HashMap from int (representing the week number) to Set<Project>, then iterate over your projects and for each one, put it into the map at the right place. After that, iterate over the map's key set (= all non-empty weeks) and do your processing for each one.

Stream of numbers and best space complexity to find n/2th element

I was trying to solve this problem where a stream of numbers of length not more than M will be given. You don't know the exact length of the stream but are sure that it wont exceed M. At the end of the stream, you have to tell the N/2th element of the stream, considering that N elements came in the stream. what would be best space complexity with which you can solve this problem
my solution:
i think we can take a queue of size m/2 , and push two element , then pop 1 element and keep on till
stream is over . The n/2th will be at head of queue then. Time complexity will be min O(n) for any way , but for this approach,space complexity is m/2 .. is there any better solution?
I hope it is obvious that you will need at least N/2 memory allocation (Unless you can re-iterate through your steam, reading the same data again) . Your algorithm uses M/2, given the fact that N is upper bounded by M would make it look like it doesn't matter which you will choose, since N can go up to M.
But it doesn't have to. If you consider N being way smaller than M (for example N=5 and M=1 000 000) then you would waste a lot of resources.
I would recommend some dynamic growth array structure, something like ArrayList, but that is not good for removing first element.
Conclusion: You can have O(N) both time and memory complexity, and you can't get any better.
Friendly edit regarding ArrayList: adding an element to an ArrayList is in "amortized constant time", so adding N items is O(N) in time. Removing them, however, is linear (per JavaDoc) so you can definitely get O(N) in time and space but ONLY IF you don't remove anything. If you do remove, you get O(N) in space (O(N/2) = O(N)), but your time complexity goes up.
Do you know the "tortoise and hare" algorithm? Start with two pointers to the beginning of the input. Then at each step advance the hare two elements and the tortoise one element. When the hare reaches the end of the input the tortoise is at the midpoint. This is O(n) time, since it visits each element of the input once, and O(1) space, since it keeps exactly two pointers regardless of the size of the input.

Data Structure to Find Next Greatest Value Below Arbitrary Number

I'm trying to find a data structure to use in my Java project. What I'm trying to do is get the next greatest value below an arbitrary number from a set of numbers, or be notified if no such number exists.
Example 1)
My Arbitrary number is 7.0.
{3.1, 6.0, 7.13131313, 8.0}
The number I'd need to get from this set would be 6.0.
Example 2)
My arbitrary number is 1.0.
{2.0, 3.5555, 999.0}
A next highest number doesn't exist in the set, so I'd need to know it doesn't exist.
The best I can think of is indexing and comparing through an array, and going back 1 step once I go over my arbitrary number. In worst case scenarios though my time complexity would be O(n). Is there a better way?
If you can pre-process the list of values, then you can sort the list (O(NLogN) time) and perform a binary search which will take O(LogN) for each value you want to get an answer for. otherwise you can't do better than O(N).
You need to sort the numbers at first.
And then you could do a simple binary search whose compare function modified to your need. At every point check the element is bigger than input, if so search in the left side or in the right side. Your modified binary search, at the end should be able to provide the immediate bigger and the smaller number with which you could solve your problem easily. Complexity is lg n.
I suggest that you look at either TreeSet.floor(...) or TreeSet.lower(...). One of those should satisfy your requirements, and both have O(logN) complexity ... assuming that you have already built the TreeSet instance.
If you only have a sorted array and don't want the overhead of building a TreeSet, then a custom binary search is the probably the best bet.
Your both example sets look sorted ...
If it is the case then you would need a binary search...
If it's not the case then you would need to visit every elements exactly one time.so it would take time n..

Java Big-O performance

I have a question about the performance of my class project.
I have about 5000 game objects formed from reading a text file. I have a Treemap (called supertree) that holds as its nodes Treemaps (mini treemaps I guess). These nodes/mini treemaps are action, strategy, adventure, sports, gametitle, etc. Basically game genres and these mini trees will will hold game objects. So the supertree itself will hold probably 8 nodes/treemaps.
When I insert a game object, it will determine which mini tree it will go to and put it in there. For example if I insert the game Super Mario World, it will check which genre it is and see that it's adventure,so Super Mario World would be inserted into the adventure tree.
So my question is what would be the performance if the question lists all the action games, since a Treemap get is O(log n)
First at the super tree it will look for the Action Node/Treemap, which will take O(log n).
Then once inside the Action treemap, it will do get for all elements which would be o(n log n) correct?
So the total performance of log n * (n * log n) is correct? Which is worst than o(n).
[edit]
Hopefully this clarified my post a bit.
While the get on the supermap is O(n_categories), and going through the other map (using an iterator) should be O(n_games). If you n_categories has an upper bound of say 10 (because the number of categories doesn't change when adding new games), you can assume the supermap lookup to be O(1).
Since the submaps can have at most n_games entries (when all belong to the same category), listing all games of type action thus gives you O(n_games). Don't forget that in order to iterate over all entries you don't have to call get() each time. That would be like reading through a book and instead of turning the page to get from page 100 to 101, start counting at the beginning and count to 101...
EDIT: Since the above paragraph stating that if the number of categories is fixed , one can assume the category lookup to be O(1) seems to be hard to accept, let me say that even if you insist category lookup is O(log n_categories), that still gives O(n_games) since the category lookup has to be done only once. Then, you iterate over the result, which is O(n_games). This leads to O(n_games + log n_categories) = O(n_games).
Okay, first thing, your big-O isn't going to change depending on language; that's why people use big-O (asymptotic) notation.
Now, think about your whole algorithm. You take your outer tree and get each element, which is indeed O(n0 lg n0). For each of those nodes, you have O(n1 lg n1). The lg n's differ by only a constant, so they can be combined, and you get O(no×n1 lg n), or O(n2 lg n).
A couple of comments regarding the OP's analysis:
I'm assuming you have already constructed your treemaps/sets and are just extracting elements from the finished (preprocessed) in-memory representation.
Let's say n is the number of genres. Let's say m is the max number of games per genre.
The complexity of getting the right 'genre map' is O(lg n) (a single get for the supertree). The complexity of iterating over the games in that genre depends on how you do it:
for (GameRef g : submap.keySet()) {
// do something with supermap.get(g)
}
This code yields O(m) 'get' operations of O(lg m) complexity each, so that's O(m lg(m)).
If you do this:
for (Map.Entry e : submap.entrySet()) {
// do something with e.getValue()
}
then the complexity is O(m) loop iterations with constant (O(1)) time access to the value.
Using the second map iteration method, your total complexity is O(lg(n) + m)
Er, you were right until the last paragraph.
Your total complexity is O(n logn), logn to look up the type and n to list all the values in that type.
If you're talking about listing everything, it's definitely not O(n^2 logn), since getting all the values in your tree is linear. It would be O(n^2).
Doing the same thing with a flat list would be O(n logn), so you're definitely losing performance (not to mention memory) by using a tree for this.

Categories

Resources