Get List common value count - java

I have two ArrayList<Long> with huge size about 5,00,000 in each. I have tried using for loop which usage list.contains(object), but it takes too much time. I have tried by splitting one list and comparing in multiple threads but no effective result found.
I need the no. of elements that are same in both list.
Any optimized way?

Let l1 be the first list and l2 the second list. In Big O notation, that runs in O(l1*l2)
Another approach could be to insert one list into a HashSet, then for all other elements in the other list test if it exist in the HashSet. This would give roughly 2*l1+l2 -> O(l1+l2)

Have you considered putting you elements into a HashSet instead? This would make the lookups much faster. This would of course only work if you don't have duplicates.
If you have duplicates you could construct HashMap that has the value as the key and the count as the value.

General mechanism would be to sort both lists and then iterate the sorted lists looking for matches.

A list isn't a efficient data structure when you have much elements, you have to use a data structure more efficent when you search a element.
For example an tree or a hashmap!

Let us assume that list one has m elements and list two has n elements , m>n. If elements are not numerically ordered , it seems that they are not , total number of comparison steps - that is the cost of the method - factor mxn - n^2/2. In this case cost factor is about 50000x49999.
Keeping both lists ordered will be the optimal solution. If lists are ordered , cost of comparison of these will be factor m. In this case that is about 50000. This optimal result will be achieved , when both of lists are iterated via two cursor. This method can be represented in code as follows :
int i=0,j=0;
int count=0;
while(i<List1.size() && j<List2.size())
{
if(List1[i]==List2[j])
{
count++;
i++;
}
else if(List1[i]<List2[j])
i++;
else
j++;
}
If it is possible for you to keep lists ordered all the time , this method will make difference. Also I consider that it is not possible split and compare unless lists are ordered.

Related

How reduce the complexity of the searching in two lists algorithm?

I have to find some common items in two lists. I cannot sort it, order is important. Have to find how many elements from secondList occur in firstList. Now it looks like below:
int[] firstList;
int[] secondList;
int iterator=0;
for(int i:firstList){
while(i <= secondList[iterator]/* two conditions more */){
iterator++;
//some actions
}
}
Complexity of this algorithm is n x n. I try to reduce the complexity of this operation, but I don't know how compare elements in different way? Any advice?
EDIT:
Example: A=5,4,3,2,3 B=1,2,3
We look for pairs B[i],A[j]
Condition:
when
B[i] < A[j]
j++
when
B[i] >= A[j]
return B[i],A[j-1]
next iteration through the list of A to an element j-1 (mean for(int z=0;z<j-1;z++))
I'm not sure, Did I make myself clear?
Duplicated are allowed.
My approach would be - put all the elements from the first array in a HashSet and then do an iteration over the second array. This reduces the complexity to the sum of the lengths of the two arrays. It has the downside of taking additional memory, but unless you use more memory I don't think you can improve your brute force solution.
EDIT: to avoid further dispute on the matter. If you are allowed to have duplicates in the first array and you actually care how many times does an element in the second array match an array in the first one, use HashMultiSet.
Put all the items of the first list in a set
For each item of the second list, test if its in the set.
Solved in less than n x n !
Edit to please fge :)
Instead of a set, you can use a map with the item as key and the number of occurrence as value.
Then for each item of the second list, if it exists in the map, execute your action once per occurence in the first list (dictionary entries' value).
import java.util.*;
int[] firstList;
int[] secondList;
int iterator=0;
HashSet hs = new HashSet(Arrays.asList(firstList));
HashSet result = new HashSet();
while(i <= secondList.length){
if (hs.contains( secondList[iterator]))
{
result.add(secondList[iterator]);
}
iterator++;
}
result will contain required common element.
Algorithm complexity n
Just because the order is important doesn't mean that you cannot sort either list (or both). It only means you will have to copy first before you can sort anything. Of course, copying requires additional memory and sorting requires additional processing time... yet I guess all solutions that are better than O(n^2) will require additional memory and processing time (also true for the suggested HashSet solutions - adding all values to a HashSet costs additional memory and processing time).
Sorting both lists is possible in O(n * log n) time, finding common elements once the lists are sorted is possible in O(n) time. Whether it will be faster than your native O(n^2) approach depends on the size of the lists. In the end only testing different approaches can tell you which approach is fastest (and those tests should use realistic list sizes as to be expected in your final code).
The Big-O notation is no notation that tells you anything about absolute speed, it only tells you something about relative speed. E.g. if you have two algorithms to calculate a value from an input set of elements, one is O(1) and the other one is O(n), this doesn't mean that the O(1) solution is always faster. This is a big misconception of the Big-O notation! It only means that if the number of input elements doubles, the O(1) solution will still take approx. the same amount of time while the O(n) solution will take approx. twice as much time as before. So there is no doubt that by constantly increasing the number of input elements, there must be a point where the O(1) solution will become faster than the O(n) solution, yet for a very small set of elements, the O(1) solution may in fact be slower than the O(n) solution.
OK, so this solution will work if there are no duplicates in either the first or second array. As the question does not tell, we cannot be sure.
First, build a LinkedHashSet<Integer> out of the first array, and a HashSet<Integer> out of the second array.
Second, retain in the first set only elements that are in the second set.
Third, iterate over the first set and proceed:
// A LinkedHashSet retains insertion order
Set<Integer> first = LinkedHashSet<Integer>(Arrays.asList(firstArray));
// A HashSet does not but we don't care
Set<Integer> second = new HashSet<Integer>(Arrays.asList(secondArray));
// Retain in first only what is in second
first.retainAll(second);
// Iterate
for (int i: first)
doSomething();

Better methods to compare corresponding values in multiple arrays in Java

I have an ArraysList containing M lists which are sorted. Each list in the Arraylist has the same size N. Now I want to compare the first (N-1) corresponding values in each list with others and I want to find those list with the same first(N-1) values. Intuitively, it can be done by two for-loops, but the complexity could be as high as M*N*N. I was wondering whether there are some better algorithms to do this. By the way, M could may be a very large number while N tends to be a smaller one.
Sorry, I might not be clear. I want the final output is pairs of list which have the same first (N-1) values.
Use a good hashing algorithm to calculate a hash code of the N-1 items in each row. Organize rows by their hash code, and do a full compare only when the hash codes match.
Sort the list of lists.
Sorting them is O(N M LOG M) (assuming that a comparison is O(N)).
If you do this in a radix sort approach, it should actually be more on the lines of O(N * M) or even O(M LOG M) total (assuming the lists are not identical).
Then lists with the same prefix must be subsequent in this list.
Assuming that you are trying to reimplement APRIORI: yes, do keep a sorted list of candidate itemsets. This is exactly what Apriori-Gen needs for building the next round candidates. Keeping them organized as a sorted tree is quite neat, as this is also fast when scanning the database for counting itemsets.

Best way to write this program

I have a general programming question, that I have happened to use Java to answer. This is the question:
Given an array of ints write a program to find out how many numbers that are not unique are in the array. (e.g. in {2,3,2,5,6,1,3} 2 numbers (2 and 3) are not unique). How many operations does your program perform (in O notation)?
This is my solution.
int counter = 0;
for(int i=0;i<theArray.length-1;i++){
for(int j=i+1;j<theArray.length;j++){
if(theArray[i]==theArray[j]){
counter++;
break; //go to next i since we know it isn't unique we dont need to keep comparing it.
}
}
}
return counter:
Now, In my code every element is being compared with every other element so there are about n(n-1)/2 operations. Giving O(n^2). Please tell me if you think my code is incorrect/inefficient or my O expression is wrong.
Why not use a Map as in the following example:
// NOTE! I assume that elements of theArray are Integers, not primitives like ints
// You'll nee to cast things to Integers if they are ints to put them in a Map as
// Maps can't take primitives as keys or values
Map<Integer, Integer> elementCount = new HashMap<Integer, Integer>();
for (int i = 0; i < theArray.length; i++) {
if (elementCount.containsKey(theArray[i]) {
elementCount.put(theArray[i], new Integer(elementCount.get(theArray[i]) + 1));
} else {
elementCount.put(theArray[i], new Integer(1));
}
}
List<Integer> moreThanOne = new ArrayList<Integer>();
for (Integer key : elementCount.keySet()) { // method may be getKeySet(), can't remember
if (elementCount.get(key) > 1) {
moreThanOne.add(key);
}
}
// do whatever you want with the moreThanOne list
Notice that this method requires iterating through the list twice (I'm sure there's a way to do it iterating once). It iterates once through theArray, and then implicitly again as it iterates through the key set of elementCount, which if no two elements are the same, will be exactly as large. However, iterating through the same list twice serially is still O(n) instead of O(n^2), and thus has much better asymptotic running time.
Your code doesn't do what you want. If you run it using the array {2, 2, 2, 2}, you'll find that it returns 2 instead of 1. You'll have to find a way to make sure that the counting is never repeated.
However, your Big O expression is correct as a worst-case analysis, since every element might be compared with every other element.
Your analysis is correct but you could easily get it down to O(n) time. Try using a HashMap<Integer,Integer> to store previously-seen values as you iterate through the array (key is the number that you've seen, value is the number of times you've seen it). Each time you try to add an integer into the hashmap, check to see if it's already there. If it is, just increment that integers counter. Then, at the end, loop through the map and count the number of times you see a key with a corresponding value higher than 1.
First, your approach is what I would call "brute force", and it is indeed O(n^2) in the worst case. It's also incorrectly implemented, since numbers that repeat n times are counted n-1 times.
Setting that aside, there are a number of ways to approach the problem. The first (that a number of answers have suggested) is to iterate the array, and using a map to keep track of how many times the given element has been seen. Assuming the map uses a hash table for the underlying storage, the average-case complexity should be O(n), since gets and inserts from the map should be O(1) on average, and you only need to iterate the list and map once each. Note that this is still O(n^2) in the worst case, since there's no guarantee that the hashing will produce contant-time results.
Another approach would be to simply sort the array first, and then iterate the sorted array looking for duplicates. This approach is entirely dependent on the sort method chosen, and can be anywhere from O(n^2) (for a naive bubble sort) to O(n log n) worst case (for a merge sort) to O(n log n) average-though-likely case (for a quicksort.)
That's the best you can do with the sorting approach assuming arbitrary objects in the array. Since your example involves integers, though, you can do much better by using radix sort, which will have worst-case complexity of O(dn), where d is essentially constant (since it maxes out at 9 for 32-bit integers.)
Finally, if you know that the elements are integers, and that their magnitude isn't too large, you can improve the map-based solution by using an array of size ElementMax, which would guarantee O(n) worst-case complexity, with the trade-off of requiring 4*ElementMax additional bytes of memory.
I think your time complexity of O(n^2) is correct.
If space complexity is not the issue then you can have an array of 256 characters(ASCII) standard and start filling it with values. For example
// Maybe you might need to initialize all the values to 0. I don't know. But the following can be done with O(n+m) where n is the length of theArray and m is the length of array.
int[] array = new int[256];
for(int i = 0 ; i < theArray.length(); i++)
array[theArray[i]] = array[theArray[i]] + 1;
for(int i = 0 ; i < array.length(); i++)
if(array[i] > 1)
System.out.print(i);
As others have said, an O(n) solution is quite possible using a hash. In Perl:
my #data = (2,3,2,5,6,1,3);
my %count;
$count{$_}++ for #data;
my $n = grep $_ > 1, values %count;
print "$n numbers are not unique\n";
OUTPUT
2 numbers are not unique

picking without replacement in java

I often* find myself in need of a data structure which has the following properties:
can be initialized with an array of n objects in O(n).
one can obtain a random element in O(1), after this operation the picked
element is removed from the structure.
(without replacement)
one can undo p 'picking without replacement' operations in O(p)
one can remove a specific object (eg by id) from the structure in O(log(n))
one can obtain an array of the objects currently in the structure in
O(n).
the complexity (or even possibility) of other actions (eg insert) does not matter. Besides the complexity it should also be efficient for small numbers of n.
Can anyone give me guidelines on implementing such a structure? I currently implemented a structure having all above properties, except the picking of the element takes O(d) with d the number of past picks (since I explicitly check whether it is 'not yet picked'). I can figure out structures allowing picking in O(1), but these have higher complexities on at least one of the other operations.
BTW:
note that O(1) above implies that the complexity is independent from #earlier picked elements and independent from total #elements.
*in monte carlo algorithms (iterative picks of p random elements from a 'set' of n elements).
HashMap has complexity O(1) both for insertion and removal.
You specify a lot of operation, but all of them are nothing else then insertion, removal and traversing:
can be initialized with an array of n objects in O(n).
n * O(1) insertion. HashMap is fine
one can obtain a random element in
O(1), after this operation the picked
element is removed from the structure.
(without replacement)
This is the only op that require O(n).
one can undo p 'picking without
replacement' operations in O(p)
it's an insertion operation: O(1).
one can remove a specific object (eg
by id) from the structure in O(log(n))
O(1).
one can obtain an array of the objects
currently in the structure in O(n).
you can traverse an HashMap in O(n)
EDIT:
example of picking up a random element in O(n):
HashMap map ....
int randomIntFromZeroToYouHashMapSize = ...
Collection collection = map.values();
Object[] values = collection.toArray();
values[randomIntFromZeroToYouHashMapSize];
Ok, same answer as 0verbose with a simple fix to get the O(1) random lookup. Create an array which stores the same n objects. Now, in the HashMap, store the pairs . For example, say your Objects (strings for simplicity) are:
{"abc" , "def", "ghi"}
Create an
List<String> array = ArrayList<String>("abc","def","ghi")
Create a HashMap map with the following values:
for (int i = 0; i < array.size(); i++)
{
map.put(array[i],i);
}
O(1) random lookup is easily achieved by picking any index in the array. The only complication that arises is when you delete an object. For that, do:
Find object in map. Get its array index. Lets call this index i (map.get(i)) - O(1)
Swap array[i] with array[size of array - 1] (the last element in the array). Reduce the size of the array by 1 (since there is one less number now) - O(1)
Update the index of the new object in position i of the array in map (map.put(array[i], i)) - O(1)
I apologize for the mix of java and cpp notation, hope this helps
Here's my analysis of using Collections.shuffle() on an ArrayList:
✔ can be initialized with an array of n objects in O(n).
Yes, although the cost is amortized unless n is known in advance.
✔ one can obtain a random element in O(1), after this operation the picked element is removed from the structure, without replacement.
Yes, choose the last element in the shuffled array; replace the array with a subList() of the remaining elements.
✔ one can undo p 'picking without replacement' operations in O(p).
Yes, append the element to the end of this list via add().
❍ one can remove a specific object (eg by id) from the structure in O(log(n)).
No, it looks like O(n).
✔ one can obtain an array of the objects currently in the structure in O(n).
Yes, using toArray() looks reasonable.
How about an array (or ArrayList) that's divided into "picked" and "unpicked"? You keep track of where the boundary is, and to pick, you generate a random index below the boundary, then (since you don't care about order), swap the item at that index with the last unpicked item, and decrement the boundary. To unpick, you just increment the boundary.
Update: Forgot about O(log(n)) removal. Not that hard, though, just a little memory-expensive, if you keep a HashMap of IDs to indices.
If you poke around on line you'll find various IndexedHashSet implementations that all work on more or less this principle -- an array or ArrayList plus a HashMap.
(I'd love to see a more elegant solution, though, if one exists.)
Update 2: Hmm... or does the actual removal become O(n) again, if you have to either recopy the arrays or shift them around?

Extracting a given number of the highest values in a List

I'm seeking to display a fixed number of items on a web page according to their respective weight (represented by an Integer). The List where these items are found can be of virtually any size.
The first solution that comes to mind is to do a Collections.sort() and to get the items one by one by going through the List. Is there a more elegant solution though that could be used to prepare, say, the top eight items?
Just go for Collections.sort(..). It is efficient enough.
This algorithm offers guaranteed n log(n) performance.
You can try to implement something more efficient for your concrete case if you know some distinctive properties of your list, but that would not be justified. Furthermore, if your list comes from a database, for example, you can LIMIT it & order it there instead of in code.
Your options:
Do a linear search, maintaining the top N weights found along the way. This should be quicker than sorting a lengthly list if, for some reason, you can't reuse the sorting results between displaying the page (e.g. the list is changing quickly).
UPDATE: I stand corrected on the linear search necessarily being better than sorting. See Wikipedia article "Selection_algorithm - Selecting k smallest or largest elements" for better selection algorithms.
Manually maintain a List (the original one or a parallel one) sorted in weight order. You can use methods like Collections.binarySearch() to determine where to insert each new item.
Maintain a List (the original one or a parallel one) sorted in weight order by calling Collections.sort() after each modification, batch modifications, or just before display (possibly maintaining a modification flag to avoid sorting an already sorted list).
Use a data structure that maintains sorted weight-order for you: priority queue, tree set, etc. You could also create your own data structure.
Manually maintain a second (possibly weight-ordered) data structure of the top N items. This data structure is updated anytime the original data structure is modified. You could create your own data structure to wrap the original list and this "top N cache" together.
You could use a max-heap.
If your data originates from a database, put an index on that column and use ORDER BY and TOP or LIMIT to fetch only the records you need to display.
Or a priority queue.
using dollar:
List<Integer> topTen = $(list).sort().slice(10).toList();
without using dollar you should sort() it using Collections.sort(), then get the first n items using list.sublist(0, n).
Since you say the list of items from which to extract these top N may be of any size, and so may be large I assume, I'd augment the simple sort() answers above (which are entirely appropriate for reasonably-sized input) by suggesting most of the work here is finding the top N -- then sorting those N is trivial. That is:
Queue<Integer> topN = new PriorityQueue<Integer>(n);
for (Integer item : input) {
if (topN.size() < n) {
topN.add(item);
} else if (item > topN.peek()) {
topN.add(item);
topN.poll();
}
}
List<Integer> result = new ArrayList<Integer>(n);
result.addAll(topN);
Collections.sort(result, Collections.reverseOrder());
The heap here (a min-heap) is at least bounded in size. There's no real need to make a heap out of all your items.
No, not really. At least not using Java's built-in methods.
There are clever ways to get the highest (or lowest) N number of items from a list quicker than an O(n*log(n)) operation, but that will require you to code this solution by hand. If the number of items stays relatively low (not more than a couple of hundred), sorting it using Collections.sort() and then grabbing the top N numbers is the way to go IMO.
Depends on how many. Lets define n as the total number of keys, and m as the number you wish to display.
Sorting the entire thing: O(nlogn)
Scanning the array each time for the next highest number: O(n*m)
So the question is - What's the relation between n to m?
If m < log n, scanning will be more efficient.
Otherwise, m >= log n, which means sorting will be better. (Since for the edge case of m = log n it doesn't actually matter, but sorting will also give you the benefit of, well, sorting the array, which is always nice.
If the size of the list is N, and the number of items to be retrieved is K, you need to call Heapify on the list, which converts the list (which has to be indexable, e.g. an array) into a priority queue. (See heapify function in http://en.wikipedia.org/wiki/Heapsort)
Retrieving an item on the top of the heap (the max item) takes O (lg N) time. So your overall time would be:
O(N + k lg N)
which is better than O (N lg N) assuming k is much smaller than N.
If keeping a sorted array or using a different data structure is not an option, you could try something like the following. The O time is similar to sorting the large array but in practice this should be more efficient.
small_array = big_array.slice( number_of_items_to_find );
small_array.sort();
least_found_value = small_array.get(0).value;
for ( item in big_array ) { // needs to skip first few items
if ( item.value > least_found_value ) {
small_array.remove(0);
small_array.insert_sorted(item);
least_found_value = small_array.get(0).value;
}
}
small_array could be an Object[] and the inner loop could be done with swapping instead of actually removing and inserting into an array.

Categories

Resources