I'm trying to find a data structure to use in my Java project. What I'm trying to do is get the next greatest value below an arbitrary number from a set of numbers, or be notified if no such number exists.
Example 1)
My Arbitrary number is 7.0.
{3.1, 6.0, 7.13131313, 8.0}
The number I'd need to get from this set would be 6.0.
Example 2)
My arbitrary number is 1.0.
{2.0, 3.5555, 999.0}
A next highest number doesn't exist in the set, so I'd need to know it doesn't exist.
The best I can think of is indexing and comparing through an array, and going back 1 step once I go over my arbitrary number. In worst case scenarios though my time complexity would be O(n). Is there a better way?
If you can pre-process the list of values, then you can sort the list (O(NLogN) time) and perform a binary search which will take O(LogN) for each value you want to get an answer for. otherwise you can't do better than O(N).
You need to sort the numbers at first.
And then you could do a simple binary search whose compare function modified to your need. At every point check the element is bigger than input, if so search in the left side or in the right side. Your modified binary search, at the end should be able to provide the immediate bigger and the smaller number with which you could solve your problem easily. Complexity is lg n.
I suggest that you look at either TreeSet.floor(...) or TreeSet.lower(...). One of those should satisfy your requirements, and both have O(logN) complexity ... assuming that you have already built the TreeSet instance.
If you only have a sorted array and don't want the overhead of building a TreeSet, then a custom binary search is the probably the best bet.
Your both example sets look sorted ...
If it is the case then you would need a binary search...
If it's not the case then you would need to visit every elements exactly one time.so it would take time n..
Related
More specifically, suppose I have an array with duplicates:
{3,2,3,4,2,2,1,4}
I want to have a data structure that supports search and remove the first occurrence of some value faster than O(n), say if the value is 4, then it becomes:
{3,2,3,2,2,1,4}
I also need to iterate the list from head according to the same order. Other operations like get(index) or insert are not needed.
You can use O(n) time to record the original data(say it's an int[]) in your data structure, I just need the later search and remove faster than O(n).
"Search and remove" is considered as ONE operation as shown above.
If I have to make it myself, I would use a LinkedList to store the data, and HashMap to map every key to a list of all occurrence of nodes together with their previous and next ones.
Is it a right approach? Are there any better choices already there in Java?
The data structure you describe, essentially a hybrid linked list and map, I think is the most efficient way of handling your stated problem. You'll have to keep track of the nodes yourself, since Java's LinkedList doesn't provide access to the actual nodes. The AbstractSequentialList may be helpful here.
The index structure you'll need is a map from an element value to the appearances of that element in the list. I recommend a hash table from hashCode % modulus to a linked list of (value, list of main-list nodes).
Note that this approach is still O(n) in the worst case, when you have universal hash collisions; this applies whether you use open or closed hashing. In the average case it should be something closer to O(ln(n)), but I'm not prepared to prove that.
Consider also whether the overhead of keeping track of all of this is really worth the gains. Unless you've actually profiled running code and determined that a LinkedList is causing problems because remove is O(n), stick with that until you do.
Since your requirement is that the first occurrence of the element should be removed and the remaining occurrences retained, there would be no way to do it faster than O(n) as you would definitely have to move through to the end of the list to find out if there is another occurrence. There is no standard api from Oracle in the java package that does this.
I have about 200 lists of the kind (index , float) and I want to calculate the mean between them, I know the way with the complexity time of O(first Array size + ... + last Array size) is there any solution to calculate the mean with the better complexity time?
There is no possible way to calculate a mean of N independent items with time complexity less than O(n): since you have to visit every item at least once to calculate the total.
If you want to beat O(n) complexity, then you will need to do something special, e.g.:
Use pre-computed sums for sub-lists
Exploit known dependencies in the data (e.g. certain elements being equal)
Of course, complexity does not always equate directly to speed. If you want to do it fast then there are plenty of other techniques (using concurrency or parallelism for example). But they will still be O(n) in complexity terms.
You approach it by divide and conquer. For that you can use ExecutorService
Depending on how you read in your lists/arrays, you could add together the floats while reading them into memory. Dividing the value by the size of the arrays is cheap. Therefore you don't need to process the values twice.
If you use a Collection class for storing the values, you could extend e.g. ArrayList and override the add() method to update a sum field and provide a getMean() method.
No, there is no better way than including every element in the calculation (which is O(first Array size + ... + last Array size)), at least not for the problem as stated. If the lists were to have some special properties or you want to recalculate the mean repeatedly after adding, removing or changing elements or lists, it would be a different story.
An informal proof:
Assume you managed to calculate the mean by skipping one element. Logically you'd get to the same mean by changing the skipped element to any other value (since we skipped it, its value doesn't matter). But, in this case, the mean should be different. Thus you must use every element in the calculation.
I've come across an interesting problem which I would love to get some input on.
I have a program that generates a set of numbers (based on some predefined conditions). Each set contains up to 6 numbers that do not have to be unique with integers that ranges from 1 to 100).
I would like to somehow store every set that is created so that I can quickly check if a certain set with the exact same numbers (order doesn't matter) has previously been generated.
Speed is a priority in this case as there might be up to 100k sets stored before the program stops (maybe more, but most the time probably less)! Would anyone have any recommendations as to what data structures I should use and how I should approach this problem?
What I have currently is this:
Sort each set before storing it into a HashSet of Strings. The string is simply each number in the sorted set with some separator.
For example, the set {4, 23, 67, 67, 71} would get encoded as the string "4-23-67-67-71" and stored into the HashSet. Then for every new set generated, sort it, encode it and check if it exists in the HashSet.
Thanks!
if you break it into pieces it seems to me that
creating a set (generate 6 numbers, sort, stringify) runs in O(1)
checking if this string exists in the hashset is O(1)
inserting into the hashset is O(1)
you do this n times, which gives you O(n).
this is already optimal as you have to touch every element once anyways :)
you might run into problems depending on the range of your random numbers.
e.g. assume you generate only numbers between one and one, then there's obviously only one possible outcome ("1-1-1-1-1-1") and you'll have only collisions from there on. however, as long as the number of possible sequences is much larger than the number of elements you generate i don't see a problem.
one tip: if you know the number of generated elements beforehand it would be wise to initialize the hashset with the correct number of elements (i.e. new HashSet<String>( 100000 ) );
p.s. now with other answers popping up i'd like to note that while there may be room for improvement on a microscopic level (i.e. using language specific tricks), your overal approach can't be improved.
Create a class SetOfIntegers
Implement a hashCode() method that will generate reasonably unique hash values
Use HashMap to store your elements like put(hashValue,instance)
Use containsKey(hashValue) to check if the same hashValue already present
This way you will avoid sorting and conversion/formatting of your sets.
Just use a java.util.BitSet for each set, adding integers to the set with the set(int bitIndex) method, you don't have to sort anything, and check a HashMap for already existing BitSet before adding a new BitSet to it, it will be really very fast. Don't use sorting of value and toString for that purpose ever if speed is important.
I was looking at the source code of the sort() method of the java.util.ArrayList on grepcode. They seem to use insertion sort on small arrays (of size < 7) and merge sort on large arrays. I was just wondering if that makes a lot of difference given that they use insertion sort only for arrays of size < 7. The difference in running time will be hardly noticeable on modern machines.
I have read this in Cormen:
Although merge sort runs in O(n*logn) worst-case time and insertion sort runs in O(n*n) worst-case time, the constant factors in insertion sort can make it faster in practice for small problem sizes on many machines. Thus, it makes sense to coarsen the leaves of the recursion by using insertion sort within merge sort when subproblems become sufficiently small.
If I would have designed sorting algorithm for some component which I require, then I would consider using insertion-sort for greater sizes (maybe upto size < 100) before the difference in running time, as compared to merge sort, becomes evident.
My question is what is the analysis behind arriving at size < 7?
The difference in running time will be hardly noticeable on modern machines.
How long it takes to sort small arrays becomes very important when you realize that the overall sorting algorithm is recursive, and the small array sort is effectively the base case of that recursion.
I don't have any inside info on how the number seven got chosen. However, I'd be very surprised if that wasn't done as the result of benchmarking the competing algorithms on small arrays, and choosing the optimal algorithm and threshold based on that.
P.S. It is worth pointing out that Java7 uses Timsort by default.
I am posting this for people who visit this thread in future and documenting my own research. I stumbled across this excellent link in my quest to find the answer to the mystery of choosing 7:
Tim Peters’s description of the algorithm
You should read the section titled "Computing minrun".
To give a gist, minrun is the cutoff size of the array below which the algorithm should start using insertion sort. Hence, we will always have sorted arrays of size "minrun" on which we will need to run merge operation to sort the entire array.
In java.util.ArrayList.sort(), "minrun" is chosen to be 7, but as far as my understanding of the above document goes, it busts that myth and shows that it should be near powers of 2 and less than 256 and more than 8. Quoting from the document:
At 256 the data-movement cost in binary insertion sort clearly hurt, and at 8 the increase in the number of function calls clearly hurt. Picking some power of 2 is important here, so that the merges end up perfectly balanced (see next section).
The point which I am making is that "minrun" can be any power of 2 (or near power of 2) less than 64, without hindering the performance of TimSort.
http://en.wikipedia.org/wiki/Timsort
"Timsort is a hybrid sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data... The algorithm finds subsets of the data that are already ordered, and uses the subsets to sort the data more efficiently. This is done by merging an identified subset, called a run, with existing runs until certain criteria are fulfilled."
About number 7:
"... Also, it is seen that galloping is beneficial only when the initial element is not one of the first seven elements of the other run. This also results in MIN_GALLOP being set to 7. To avoid the drawbacks of galloping mode, the merging functions adjust the value of min-gallop. If the element is from the array currently under consideration (that is, the array which has been returning the elements consecutively for a while), the value of min-gallop is reduced by one. Otherwise, the value is incremented by one, thus discouraging entry back to galloping mode. When this is done, in the case of random data, the value of min-gallop becomes so large, that the entry back to galloping mode never takes place.
In the case where merge-hi is used (that is, merging is done right-to-left), galloping needs to start from the right end of the data, that is the last element. Galloping from the beginning also gives the required results, but makes more comparisons than required. Thus, the algorithm for galloping includes the use of a variable which gives the index at which galloping should begin. Thus the algorithm can enter galloping mode at any index and continue thereon as mentioned above, as in, it will check at the next index which is offset by 1, 3, 7,...., (2k - 1).. and so on from the current index. In the case of merge-hi, the offsets to the index will be -1, -3, -7,...."
I got requirements-
1. Have random values in a List/Array and I need to find 3 max values .
2. I have a pool of values and each time this pool is getting updated may be in every 5 seconds, Now every time after the update , I need to find the 3 max Values from the list pool.
I thought of using Math.max thrice on the list but I dont think it as
a very optimized approach.
> Won't any sorting mechanism be costly as I am bothered about only top
3 Max Values , why to sort all these
Please suggest the best way to do it in JAVA
Sort the list, get the 3 max values. If you don't want the expense of the sort, iterate and maintain the n largest values.
Maintain the pool is a sorted collection.
Update: FYI Guava has an Ordering class with a greatestOf method to get the n max elements in a collection. You might want to check out the implementation.
Ordering.greatestOf
Traverse the list once, keeping an ordered array of three largest elements seen so far. This is trivial to update whenever you see a new element, and instantly gives you the answer you're looking for.
A priority queue should be the data structure you need in this case.
First, it would be wise to never say again, "I dont think it as a very optimized approach." You will not know which part of your code is slowing you down until you put a profiler on it.
Second, the easiest way to do what you're trying to do -- and what will be most clear to someone later if they are trying to see what your code does -- is to use Collections.sort() and pick off the last three elements. Then anyone who sees the code will know, "oh, this code takes the three largest elements." There is so much value in clear code that it will likely outweigh any optimization that you might have done. It will also keep you from writing bugs, like giving a natural meaning to what happens when someone puts the same number into the list twice, or giving a useful error message when there are only two elements in the list.
Third, if you really get data which is so large that O(n log n) operations is too slow, you should rewrite the data structure which holds the data in the first place -- java.util.NavigableSet for example offers a .descendingIterator() method which you can probe for its first three elements, those would be the three maximum numbers. If you really want, a Heap data structure can be used, and you can pull off the top 3 elements with something like one comparison, at the cost of making adding an O(log n) procedure.