I have 5 sets which has numeric values. I am interested in finding the intersection of all 5 sets.
Right now, I am thinking of the following
Do a Collections.sort() on all 5 sets
Find the shortest set and do a
shortestSet.retainAll(otherSet);
on all the other sets.
Is there a more efficient way of doing this ?
Your solution looks right to me, if we understand that when you write Collections.sort() you are sorting the list of sets, according to their sizes. The rationale would be that, if we are going to use set1.retainAll(set2) (and if the sets are HashSets) each intersection running time should be roughly linear on the number of elements of set1. So it makes sense to start with the smallest one.
Your solution is fine. There would be no need to sort the numbers before calling the retainAll method though.
Try to use
static <E> Sets.SetView<E> intersection(Set<E> set1, Set<?> set2)
In Google Guava
Related
I'm writing a method that finds intersection of given arrays.
Since I had to iterate both arrays, I was thinking of using
if(arr2.length > arr1.length){
intersection(arr2, arr1);
}
The reason I came up with this idea was that it seemed to me to be the better way to reduce the length of code for handling arrays that have different length.
As a newbie to programming, I'm wondering if there is any other suggestion.
Put your array in a list:
List a = List.asArray(arr1);
(ref: https://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#asList(T...)
And use the suggestion here to find the intersection:
How to do union, intersect, difference and reverse data in java
For a problem I'm designing in Java, I'm given a list of dates and winning lottery numbers. I'm supposed to do things with them, and spit them back out in order. I decided to choose a LinkedHashMap to solve it, the Date containing the date, and int[] containing the array of winning numbers.
Thing is, when I run the .values() function, I'm noticing the numbers are no longer ordered (by insertion). The code I'm running is:
for(int i = 0; i < 30; i++){ //testing first 30 to see if ordered
System.out.println(Arrays.toString((int [])(winningNumbers.values().toArray()[i])));
}
Can anyone see what exactly I'm doing wrong? Tempted almost to just use .get() and iterate through the dates, since the dates do go in some order, but that might make using a LinkedHashMap moot. Might as well be using a 2-D ArrayList[][] in that case. Thanks in advance!
EDIT: Adding code for further examination!
Lottery Class: http://pastebin.com/9ezF5U7e
Text file: http://pastebin.com/iD8jm7f8
To test, call checkOldLTNums(). Just pass it any int[] array, it doesn't utilize it, at least relevant to this problem. The output is different from the first lines in the .txt, which is organized. Thanks!
EDIT2:
I think I figured out why it fails. I used a smaller .txt file, and the code worked perfectly. It might be that it isn't wise to load 1900 entries of stuff into memory. I suppose it's better to just load individual lines and compare them instead of grabbing everything at once. Is this logic sound? Any advice going from here would be useful.
The values() method will return the Map's values in the same order they were inserted, that's what ordered means, don't confuse it with sorted, that means a different thing - that the items follow the ordering defined by the value's Comparator or its natural ordering if no comparator was provided.
Perhaps you'd be better off using a SortedMap that keeps the items sorted by key. To summarise: LinkedHashMap is ordered and TreeMap is sorted.
Try using a TreeMap instead. You should get the values in ascending key order that way.
I can't believe I missed this. I did some more troubleshooting, and realized some of the Date keys were replacing others. After doing some testing, I finally realized that the formatting I was using was off: "dd/MM/yyyy" instead of "MM/dd/yyyy". That's what I get for copy-pasting, haha. Thanks to all who helped!
I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.
Simple question here -- mostly about APIs.
I want to iterate through an array in random order.
It is easy enough to:
fill a List with the numbers 0 to N
shuffle the List with Collections.shuffle
Use this shuffled list to guide my array iteration.
However, I was wondering if step 1 (generating the list of numbers from 0 to N) exists somewhere in prewritten code.
For instance, could it be a convenience method in guava's XYZ class??
The closest thing in Guava would be
ContiguousSet.create(Range.closedOpen(0, n), DiscreteDomains.integers())
...but, frankly, it is probably more readable just to write the for loop yourself.
Noting specifically your emphasis on 'quick', I can't imagine there'd be much quicker than
List<Integer> = new ArrayList<Integer>(range);
and then iterating and populating each entry. Note that I set the capacity in order to avoid having the list resize under the covers.
You may want to check out the Apache Commons which among many other usefull functions, implement the nextPermutation method in RandomDataGenerator class
This is obviously something much bigger then a method of populating the List or array, but commons are really powerfull libraries, which give much more good methods for mathematical computations.
Java doesn't allow you to auto-populate your values. See this question for the ways to populate an array in java
"Creating an array of numbers without looping?"
If you skip step 1 and just do the shuffeling immediately I think you will have the fastest solution.
int range = 1000;
List<Integer> arr = new ArrayList<Integer>(range);
for(int i=0;i<range;i++) {
arr.add((int)(Math.random()*i), new Integer(i));
}
I have a variable number of ArrayList's that I need to find the intersection of. A realistic cap on the number of sets of strings is probably around 35 but could be more. I don't want any code, just ideas on what could be efficient. I have an implementation that I'm about to start coding but want to hear some other ideas.
Currently, just thinking about my solution, it looks like I should have an asymptotic run-time of Θ(n2).
Thanks for any help!
tshred
Edit: To clarify, I really just want to know is there a faster way to do it. Faster than Θ(n2).
Set.retainAll() is how you find the intersection of two sets. If you use HashSet, then converting your ArrayLists to Sets and using retainAll() in a loop over all of them is actually O(n).
The accepted answer is just fine; as an update : since Java 8 there is a slightly more efficient way to find the intersection of two Sets.
Set<String> intersection = set1.stream()
.filter(set2::contains)
.collect(Collectors.toSet());
The reason it is slightly more efficient is because the original approach had to add elements of set1 it then had to remove again if they weren't in set2. This approach only adds to the result set what needs to be in there.
Strictly speaking you could do this pre Java 8 as well, but without Streams the code would have been quite a bit more laborious.
If both sets differ considerably in size, you would prefer streaming over the smaller one.
There is also the static method Sets.intersection(set1, set2) in Google Guava that returns an unmodifiable view of the intersection of two sets.
One more idea - if your arrays/sets are different sizes, it makes sense to begin with the smallest.
The best option would be to use HashSet to store the contents of these lists instead of ArrayList. If you can do that, you can create a temporary HashSet to which you add the elements to be intersected (use the putAll(..) method). Do tempSet.retainAll(storedSet) and tempSet will contain the intersection.
Sort them (n lg n) and then do binary searches (lg n).
You can use single HashSet. It's add() method returns false when the object is alredy in set. adding objects from the lists and marking counts of false return values will give you union in the set + data for histogram (and the objects that have count+1 equal to list count are your intersection). If you throw the counts to TreeSet, you can detect empty intersection early.
In case that is required the state if 2 set has intersection, I use the next snippet on Java 8+ versions code:
set1.stream().anyMatch(set2::contains)