Efficiently finding the intersection of a variable number of sets of strings - java

I have a variable number of ArrayList's that I need to find the intersection of. A realistic cap on the number of sets of strings is probably around 35 but could be more. I don't want any code, just ideas on what could be efficient. I have an implementation that I'm about to start coding but want to hear some other ideas.
Currently, just thinking about my solution, it looks like I should have an asymptotic run-time of Θ(n2).
Thanks for any help!
tshred
Edit: To clarify, I really just want to know is there a faster way to do it. Faster than Θ(n2).

Set.retainAll() is how you find the intersection of two sets. If you use HashSet, then converting your ArrayLists to Sets and using retainAll() in a loop over all of them is actually O(n).

The accepted answer is just fine; as an update : since Java 8 there is a slightly more efficient way to find the intersection of two Sets.
Set<String> intersection = set1.stream()
.filter(set2::contains)
.collect(Collectors.toSet());
The reason it is slightly more efficient is because the original approach had to add elements of set1 it then had to remove again if they weren't in set2. This approach only adds to the result set what needs to be in there.
Strictly speaking you could do this pre Java 8 as well, but without Streams the code would have been quite a bit more laborious.
If both sets differ considerably in size, you would prefer streaming over the smaller one.

There is also the static method Sets.intersection(set1, set2) in Google Guava that returns an unmodifiable view of the intersection of two sets.

One more idea - if your arrays/sets are different sizes, it makes sense to begin with the smallest.

The best option would be to use HashSet to store the contents of these lists instead of ArrayList. If you can do that, you can create a temporary HashSet to which you add the elements to be intersected (use the putAll(..) method). Do tempSet.retainAll(storedSet) and tempSet will contain the intersection.

Sort them (n lg n) and then do binary searches (lg n).

You can use single HashSet. It's add() method returns false when the object is alredy in set. adding objects from the lists and marking counts of false return values will give you union in the set + data for histogram (and the objects that have count+1 equal to list count are your intersection). If you throw the counts to TreeSet, you can detect empty intersection early.

In case that is required the state if 2 set has intersection, I use the next snippet on Java 8+ versions code:
set1.stream().anyMatch(set2::contains)

Related

Java efficiency (Calling method)

I'm writing a method that finds intersection of given arrays.
Since I had to iterate both arrays, I was thinking of using
if(arr2.length > arr1.length){
intersection(arr2, arr1);
}
The reason I came up with this idea was that it seemed to me to be the better way to reduce the length of code for handling arrays that have different length.
As a newbie to programming, I'm wondering if there is any other suggestion.
Put your array in a list:
List a = List.asArray(arr1);
(ref: https://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#asList(T...)
And use the suggestion here to find the intersection:
How to do union, intersect, difference and reverse data in java

How to compare 2 Lists in Java

I am trying to compare 2 lists to each other. Both lists have tens of thousands of entries.
My idea so far has been to use 2 ArrayLists and comparing them element by element. However I have been told that comparing too much can corrupt eclipse. No idea if this is true. Though better safe than sorry.
If you know any tips on comparing tens of thousands of Strings, please let me know. Thanks.
You have to compare each individual elements in both arrays try sorting the array and then using a for loop to run through it
What does mean "corrupt eclipse"?
However no, if you have two arraylist and you want to know if they are equal, you have to compare all the variables unless you find difference or end of a list. The complexity is O(n) - linear - which is the best you can get without some pre processing (which itself would be O(n) in best case).
If you are comparing 2 numbers, the expected output can be -
-> first greater than second
-> first less than second
-> first equal to second
But, what do you mean by comparing 2 lists? Are you planning to compare the length of both the lists? If so, you get the size of both the lists and compare them.
If you want to compare every element in the list with the every other element, it would mean, you are trying to sort the list. If you are planning to do the same thing with 2 lists, it might mean- you are trying to merge-sort both the lists. Meaning, you are creating 1 list containing all the elements of both the lists in sorted order.
This following video can help you for understanding merge-sort.
https://www.youtube.com/watch?v=EeQ8pwjQxTM
This code would have an average time complexity of O(n log n). And for the lists with tens of thousands of elements, the algorithm would have considerable time complexity, but eclipse wouldn't corrupt. Worst case, if your code is not written properly, memory leaks can cause a problem in the JVM.
I hope my answer helps you. I might help you in a better way if your question is clearer and more specific.

Efficient Intersection and Union of Lists of Strings

I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.

Quick call to generate array containing number from 0 to N

Simple question here -- mostly about APIs.
I want to iterate through an array in random order.
It is easy enough to:
fill a List with the numbers 0 to N
shuffle the List with Collections.shuffle
Use this shuffled list to guide my array iteration.
However, I was wondering if step 1 (generating the list of numbers from 0 to N) exists somewhere in prewritten code.
For instance, could it be a convenience method in guava's XYZ class??
The closest thing in Guava would be
ContiguousSet.create(Range.closedOpen(0, n), DiscreteDomains.integers())
...but, frankly, it is probably more readable just to write the for loop yourself.
Noting specifically your emphasis on 'quick', I can't imagine there'd be much quicker than
List<Integer> = new ArrayList<Integer>(range);
and then iterating and populating each entry. Note that I set the capacity in order to avoid having the list resize under the covers.
You may want to check out the Apache Commons which among many other usefull functions, implement the nextPermutation method in RandomDataGenerator class
This is obviously something much bigger then a method of populating the List or array, but commons are really powerfull libraries, which give much more good methods for mathematical computations.
Java doesn't allow you to auto-populate your values. See this question for the ways to populate an array in java
"Creating an array of numbers without looping?"
If you skip step 1 and just do the shuffeling immediately I think you will have the fastest solution.
int range = 1000;
List<Integer> arr = new ArrayList<Integer>(range);
for(int i=0;i<range;i++) {
arr.add((int)(Math.random()*i), new Integer(i));
}

Efficient way of finding intersection of 5 Sets

I have 5 sets which has numeric values. I am interested in finding the intersection of all 5 sets.
Right now, I am thinking of the following
Do a Collections.sort() on all 5 sets
Find the shortest set and do a
shortestSet.retainAll(otherSet);
on all the other sets.
Is there a more efficient way of doing this ?
Your solution looks right to me, if we understand that when you write Collections.sort() you are sorting the list of sets, according to their sizes. The rationale would be that, if we are going to use set1.retainAll(set2) (and if the sets are HashSets) each intersection running time should be roughly linear on the number of elements of set1. So it makes sense to start with the smallest one.
Your solution is fine. There would be no need to sort the numbers before calling the retainAll method though.
Try to use
static <E> Sets.SetView<E> intersection(Set<E> set1, Set<?> set2)
In Google Guava

Categories

Resources