picking without replacement in java

picking without replacement in java - java

I often* find myself in need of a data structure which has the following properties:
can be initialized with an array of n objects in O(n).
one can obtain a random element in O(1), after this operation the picked
element is removed from the structure.
(without replacement)
one can undo p 'picking without replacement' operations in O(p)
one can remove a specific object (eg by id) from the structure in O(log(n))
one can obtain an array of the objects currently in the structure in
O(n).
the complexity (or even possibility) of other actions (eg insert) does not matter. Besides the complexity it should also be efficient for small numbers of n.
Can anyone give me guidelines on implementing such a structure? I currently implemented a structure having all above properties, except the picking of the element takes O(d) with d the number of past picks (since I explicitly check whether it is 'not yet picked'). I can figure out structures allowing picking in O(1), but these have higher complexities on at least one of the other operations.
BTW:
note that O(1) above implies that the complexity is independent from #earlier picked elements and independent from total #elements.
*in monte carlo algorithms (iterative picks of p random elements from a 'set' of n elements).

HashMap has complexity O(1) both for insertion and removal.
You specify a lot of operation, but all of them are nothing else then insertion, removal and traversing:
can be initialized with an array of n objects in O(n).
n * O(1) insertion. HashMap is fine
one can obtain a random element in
O(1), after this operation the picked
element is removed from the structure.
(without replacement)
This is the only op that require O(n).
one can undo p 'picking without
replacement' operations in O(p)
it's an insertion operation: O(1).
one can remove a specific object (eg
by id) from the structure in O(log(n))
O(1).
one can obtain an array of the objects
currently in the structure in O(n).
you can traverse an HashMap in O(n)
EDIT:
example of picking up a random element in O(n):
HashMap map ....
int randomIntFromZeroToYouHashMapSize = ...
Collection collection = map.values();
Object[] values = collection.toArray();
values[randomIntFromZeroToYouHashMapSize];

Ok, same answer as 0verbose with a simple fix to get the O(1) random lookup. Create an array which stores the same n objects. Now, in the HashMap, store the pairs . For example, say your Objects (strings for simplicity) are:
{"abc" , "def", "ghi"}
Create an
List<String> array = ArrayList<String>("abc","def","ghi")
Create a HashMap map with the following values:
for (int i = 0; i < array.size(); i++)
{
map.put(array[i],i);
}
O(1) random lookup is easily achieved by picking any index in the array. The only complication that arises is when you delete an object. For that, do:
Find object in map. Get its array index. Lets call this index i (map.get(i)) - O(1)
Swap array[i] with array[size of array - 1] (the last element in the array). Reduce the size of the array by 1 (since there is one less number now) - O(1)
Update the index of the new object in position i of the array in map (map.put(array[i], i)) - O(1)
I apologize for the mix of java and cpp notation, hope this helps

Here's my analysis of using Collections.shuffle() on an ArrayList:
✔ can be initialized with an array of n objects in O(n).
Yes, although the cost is amortized unless n is known in advance.
✔ one can obtain a random element in O(1), after this operation the picked element is removed from the structure, without replacement.
Yes, choose the last element in the shuffled array; replace the array with a subList() of the remaining elements.
✔ one can undo p 'picking without replacement' operations in O(p).
Yes, append the element to the end of this list via add().
❍ one can remove a specific object (eg by id) from the structure in O(log(n)).
No, it looks like O(n).
✔ one can obtain an array of the objects currently in the structure in O(n).
Yes, using toArray() looks reasonable.

How about an array (or ArrayList) that's divided into "picked" and "unpicked"? You keep track of where the boundary is, and to pick, you generate a random index below the boundary, then (since you don't care about order), swap the item at that index with the last unpicked item, and decrement the boundary. To unpick, you just increment the boundary.
Update: Forgot about O(log(n)) removal. Not that hard, though, just a little memory-expensive, if you keep a HashMap of IDs to indices.
If you poke around on line you'll find various IndexedHashSet implementations that all work on more or less this principle -- an array or ArrayList plus a HashMap.
(I'd love to see a more elegant solution, though, if one exists.)
Update 2: Hmm... or does the actual removal become O(n) again, if you have to either recopy the arrays or shift them around?

Related

ArrayList: insertion vs insertion at specified element

Consider an Arraylist. Internally it is not full, and the number of elements inserted so far is known. The elements are not sorted.
Choose the operations listed below that are fast regardless of the number of elements contained in the ArrayList. (In other words, takes only several instructions to implement).
Insertion
Insertion at a given index
Getting the data from a specified index
Finding the maximum value in an array of integers (not necessarily sorted)
Deletion at the given index
Replacing an element at a specified index
Searching for a specific element
I chose Insertion at specified index, Getting the data from a specified index, and replacing an element but answer key says Insertion. As I usually understand it, in an ArrayList, the insert operation requires all of the elements to shift left. If we did this at the beginning of the list, we would have $O(n)$ time complexity. However, if we did it at the end, it would be $O(1)$.
My question comes down to: (1) what, if any, difference is there between insertion and insertion at specified index and (2) given this particular time complexity for insertion why is it considered "fast"

First take a look at these two methods defined in java.util.ArrayList
public boolean add(E e) {
ensureCapacityInternal(size + 1); // Increments modCount!!
elementData[size++] = e;
return true;
}
public void add(int index, E element) {
rangeCheckForAdd(index);
ensureCapacityInternal(size + 1); // Increments modCount!!
System.arraycopy(elementData, index, elementData, index + 1,
size - index);
elementData[index] = element;
size++;
}
Now if you see the first method (just adding element), it just ensures whether there's sufficient capacity and appends element to the last of the list.
So if there's sufficient capacity, then this operation would require O(1) time complexity, otherwise it would require shifting of all elements and ultimately time complexity increases to O(n).
In the second method, when you specify index, all the elements after that index would be shifted and this would definitely take more time then former.

For the first question the answer is this:
Insertion at a specified index i takes O(n), since all the elements following i will have to be shifted to right with one position.
On the other hand, simple insertion as implemented in Java (ArrayList.add()) will only take O(1) because the element is appended to the end of the array, so no shift is required.
For the second question, it is obvious why simple insertion is fast: no extra operation is needed, so time is constant.

ArrayList internally is nothing but an Array itself which uses Array.copyOf to create a new Array with increased size,upon add,but with original content intact.
So about insertion, whether you do a simple add (which will add the data at the end of the array) or on ,say, first(0th) index , it will still be faster then most data structures , keeping in mind the simplicity of the Data Structures.
The only difference is that simple add require no traversal but adding at index require shifting of elements to the left, similarly for delete. That uses System.arrayCopy to copy one array to another with alteration in index and the data.
So ,yeah simple insertion is faster then indexed insertion.

(1) what, if any, difference is there between insertion and insertion at specified index and
An ArrayList stores it's elements consecutively. Adding to the end of the ArrayList does not require the ArrayList to be altered in any way except for adding the new element to the end of itself. Thus, this operation is O(1), taking constant time which is favorable when wanting to perform an action repetitively in a data structure.
Adding an element to an index, however, requires the ArrayList to make room for the element in some way. How is that done? Every element following the inserted element will have to be moved one step to make room for the new insertion. Your index is anything in between the first element and and the nth element (inclusively). This operation thus is O(1) at best and O(n) at worst where n is the size of the array. For large lists, O(n) takes significantly longer time than O(1).
(2) given this particular time complexity for insertion why is it considered "fast"
It is considered fast because it is O(1), or constant time. If the time complexity is truly only one operation, it is as fast as it can possibly be, other small constants are also regarded fast and are often equally notated by O(1), where the "1" does not mean one single operation strictly, but that the amount of operations does not depend on the size of something else, in your example it would be the size of the ArrayList. However, constant time complexity can involve large constants as well, but in general is regarded as the fastest as possible time complexity. To put this into context, an O(1) operations takes roughly 1 * k operations in an ArrayList with 1000 elements, while a O(n) operation takes roughly 1000 * k operations, where k is some constant.
Big-O notation is used as a metric to measure how many operations an action or a whole programs will execute when they are run.
For more information about big O-notation:
What is a plain English explanation of "Big O" notation?

(Java) data structure for fast insertion, deletion, and RANDOM SELECTION

I need a data structure that supports the following operations in O(1):
myList.add(Item)
myList.remove(Item.ID) ==> It actually requires random access
myList.getRandomElement() (with equal probability)
--(Please note that getRandomElement() does not mean random access, it just means: "Give me one of the items at random, with equal probability")
Note that my items are unique, so I don't care if a List or Set is used.
I checked some java data structures, but it seems that none of them is the solution:
HashSet supports 1,2 in O(1), but it cannot give me a random element in O(1). I need to call mySet.iterator().next() to select a random element, which takes O(n).
ArrayList does 1,3 in O(1), but it needs to do a linear search to find the element I want to delete, though it takes O(n)
Any suggestions? Please tell me which functions should I call?
If java does not have such data structure, which algorithm should I use for such purpose?

You can use combination of HashMap and ArrayList if memory permits as follows:-
Store numbers in ArrayList arr as they come.
Use HashMap to give mapping arr[i] => i
While generating random select random form arrayList
Deleting :-
check in HashMap for num => i
swap(i,arr.size()-1)
HashMap.remove(num)
HashMap(arr[i])=> i
arr.remove(arr.size()-1)
All operation are O(1) but extra O(N) space

You can use a HashMap (of ID to array index) in conjunction with an array (or ArrayList).
add could be done in O(1) by simply adding to the array and adding the ID and index to the HashMap.
remove could be done in O(1) by doing a lookup (and removal) from the HashMap to find the index, then move the last index in the array to that index, update that element's index in the HashMap and decreasing the array size by one.
getRandomElement could be done in O(1) by returning a random element from the array.
Example:
Array: [5,3,2,4]
HashMap: [5->0, 3->1, 2->2, 4->3]
To remove 3:
Look up (and remove) key 3 in the HashMap (giving 3->1)
Swap 3 and, the last element, 4 in the array
Update 4's index in the HashMap to 1
Decrease the size of the array by 1
Array: [5,4,2]
HashMap: [5->0, 2->2, 4->1]
To add 6:
Simply add it to the array and HashMap
Array: [5,4,2,6]
HashMap: [5->0, 2->2, 4->1, 6->3]

How reduce the complexity of the searching in two lists algorithm?

I have to find some common items in two lists. I cannot sort it, order is important. Have to find how many elements from secondList occur in firstList. Now it looks like below:
int[] firstList;
int[] secondList;
int iterator=0;
for(int i:firstList){
while(i <= secondList[iterator]/* two conditions more */){
iterator++;
//some actions
}
}
Complexity of this algorithm is n x n. I try to reduce the complexity of this operation, but I don't know how compare elements in different way? Any advice?
EDIT:
Example: A=5,4,3,2,3 B=1,2,3
We look for pairs B[i],A[j]
Condition:
when
B[i] < A[j]
j++
when
B[i] >= A[j]
return B[i],A[j-1]
next iteration through the list of A to an element j-1 (mean for(int z=0;z<j-1;z++))
I'm not sure, Did I make myself clear?
Duplicated are allowed.

My approach would be - put all the elements from the first array in a HashSet and then do an iteration over the second array. This reduces the complexity to the sum of the lengths of the two arrays. It has the downside of taking additional memory, but unless you use more memory I don't think you can improve your brute force solution.
EDIT: to avoid further dispute on the matter. If you are allowed to have duplicates in the first array and you actually care how many times does an element in the second array match an array in the first one, use HashMultiSet.

Put all the items of the first list in a set
For each item of the second list, test if its in the set.
Solved in less than n x n !
Edit to please fge :)
Instead of a set, you can use a map with the item as key and the number of occurrence as value.
Then for each item of the second list, if it exists in the map, execute your action once per occurence in the first list (dictionary entries' value).

import java.util.*;
int[] firstList;
int[] secondList;
int iterator=0;
HashSet hs = new HashSet(Arrays.asList(firstList));
HashSet result = new HashSet();
while(i <= secondList.length){
if (hs.contains( secondList[iterator]))
{
result.add(secondList[iterator]);
}
iterator++;
}
result will contain required common element.
Algorithm complexity n

Just because the order is important doesn't mean that you cannot sort either list (or both). It only means you will have to copy first before you can sort anything. Of course, copying requires additional memory and sorting requires additional processing time... yet I guess all solutions that are better than O(n^2) will require additional memory and processing time (also true for the suggested HashSet solutions - adding all values to a HashSet costs additional memory and processing time).
Sorting both lists is possible in O(n * log n) time, finding common elements once the lists are sorted is possible in O(n) time. Whether it will be faster than your native O(n^2) approach depends on the size of the lists. In the end only testing different approaches can tell you which approach is fastest (and those tests should use realistic list sizes as to be expected in your final code).
The Big-O notation is no notation that tells you anything about absolute speed, it only tells you something about relative speed. E.g. if you have two algorithms to calculate a value from an input set of elements, one is O(1) and the other one is O(n), this doesn't mean that the O(1) solution is always faster. This is a big misconception of the Big-O notation! It only means that if the number of input elements doubles, the O(1) solution will still take approx. the same amount of time while the O(n) solution will take approx. twice as much time as before. So there is no doubt that by constantly increasing the number of input elements, there must be a point where the O(1) solution will become faster than the O(n) solution, yet for a very small set of elements, the O(1) solution may in fact be slower than the O(n) solution.

OK, so this solution will work if there are no duplicates in either the first or second array. As the question does not tell, we cannot be sure.
First, build a LinkedHashSet<Integer> out of the first array, and a HashSet<Integer> out of the second array.
Second, retain in the first set only elements that are in the second set.
Third, iterate over the first set and proceed:
// A LinkedHashSet retains insertion order
Set<Integer> first = LinkedHashSet<Integer>(Arrays.asList(firstArray));
// A HashSet does not but we don't care
Set<Integer> second = new HashSet<Integer>(Arrays.asList(secondArray));
// Retain in first only what is in second
first.retainAll(second);
// Iterate
for (int i: first)
doSomething();

Which is the appropriate data structure?

I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?

O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.

If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.

If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...

you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.

I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.

Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers

Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.

As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.

Extracting a given number of the highest values in a List

I'm seeking to display a fixed number of items on a web page according to their respective weight (represented by an Integer). The List where these items are found can be of virtually any size.
The first solution that comes to mind is to do a Collections.sort() and to get the items one by one by going through the List. Is there a more elegant solution though that could be used to prepare, say, the top eight items?

Just go for Collections.sort(..). It is efficient enough.
This algorithm offers guaranteed n log(n) performance.
You can try to implement something more efficient for your concrete case if you know some distinctive properties of your list, but that would not be justified. Furthermore, if your list comes from a database, for example, you can LIMIT it & order it there instead of in code.

Your options:
Do a linear search, maintaining the top N weights found along the way. This should be quicker than sorting a lengthly list if, for some reason, you can't reuse the sorting results between displaying the page (e.g. the list is changing quickly).
UPDATE: I stand corrected on the linear search necessarily being better than sorting. See Wikipedia article "Selection_algorithm - Selecting k smallest or largest elements" for better selection algorithms.
Manually maintain a List (the original one or a parallel one) sorted in weight order. You can use methods like Collections.binarySearch() to determine where to insert each new item.
Maintain a List (the original one or a parallel one) sorted in weight order by calling Collections.sort() after each modification, batch modifications, or just before display (possibly maintaining a modification flag to avoid sorting an already sorted list).
Use a data structure that maintains sorted weight-order for you: priority queue, tree set, etc. You could also create your own data structure.
Manually maintain a second (possibly weight-ordered) data structure of the top N items. This data structure is updated anytime the original data structure is modified. You could create your own data structure to wrap the original list and this "top N cache" together.

You could use a max-heap.
If your data originates from a database, put an index on that column and use ORDER BY and TOP or LIMIT to fetch only the records you need to display.

Or a priority queue.

using dollar:
List<Integer> topTen = $(list).sort().slice(10).toList();
without using dollar you should sort() it using Collections.sort(), then get the first n items using list.sublist(0, n).

Since you say the list of items from which to extract these top N may be of any size, and so may be large I assume, I'd augment the simple sort() answers above (which are entirely appropriate for reasonably-sized input) by suggesting most of the work here is finding the top N -- then sorting those N is trivial. That is:
Queue<Integer> topN = new PriorityQueue<Integer>(n);
for (Integer item : input) {
if (topN.size() < n) {
topN.add(item);
} else if (item > topN.peek()) {
topN.add(item);
topN.poll();
}
}
List<Integer> result = new ArrayList<Integer>(n);
result.addAll(topN);
Collections.sort(result, Collections.reverseOrder());
The heap here (a min-heap) is at least bounded in size. There's no real need to make a heap out of all your items.

No, not really. At least not using Java's built-in methods.
There are clever ways to get the highest (or lowest) N number of items from a list quicker than an O(n*log(n)) operation, but that will require you to code this solution by hand. If the number of items stays relatively low (not more than a couple of hundred), sorting it using Collections.sort() and then grabbing the top N numbers is the way to go IMO.

Depends on how many. Lets define n as the total number of keys, and m as the number you wish to display.
Sorting the entire thing: O(nlogn)
Scanning the array each time for the next highest number: O(n*m)
So the question is - What's the relation between n to m?
If m < log n, scanning will be more efficient.
Otherwise, m >= log n, which means sorting will be better. (Since for the edge case of m = log n it doesn't actually matter, but sorting will also give you the benefit of, well, sorting the array, which is always nice.

If the size of the list is N, and the number of items to be retrieved is K, you need to call Heapify on the list, which converts the list (which has to be indexable, e.g. an array) into a priority queue. (See heapify function in http://en.wikipedia.org/wiki/Heapsort)
Retrieving an item on the top of the heap (the max item) takes O (lg N) time. So your overall time would be:
O(N + k lg N)
which is better than O (N lg N) assuming k is much smaller than N.

If keeping a sorted array or using a different data structure is not an option, you could try something like the following. The O time is similar to sorting the large array but in practice this should be more efficient.
small_array = big_array.slice( number_of_items_to_find );
small_array.sort();
least_found_value = small_array.get(0).value;
for ( item in big_array ) { // needs to skip first few items
if ( item.value > least_found_value ) {
small_array.remove(0);
small_array.insert_sorted(item);
least_found_value = small_array.get(0).value;
}
}
small_array could be an Object[] and the inner loop could be done with swapping instead of actually removing and inserting into an array.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

picking without replacement in java - java

Related

ArrayList: insertion vs insertion at specified element

(Java) data structure for fast insertion, deletion, and RANDOM SELECTION

How reduce the complexity of the searching in two lists algorithm?

Which is the appropriate data structure?

Extracting a given number of the highest values in a List

Categories

Resources