I'm writing a method that finds intersection of given arrays.
Since I had to iterate both arrays, I was thinking of using
if(arr2.length > arr1.length){
intersection(arr2, arr1);
}
The reason I came up with this idea was that it seemed to me to be the better way to reduce the length of code for handling arrays that have different length.
As a newbie to programming, I'm wondering if there is any other suggestion.
Put your array in a list:
List a = List.asArray(arr1);
(ref: https://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#asList(T...)
And use the suggestion here to find the intersection:
How to do union, intersect, difference and reverse data in java
Related
What is the best way (in terms of big-O) to search for and replace an element in a multidimensional unsorted array, retaining the structure and not transforming it into another data structure?
I'm preferably looking for a solution in Java without using any language specific libraries.
If you have an unsorted collection/array, you won't do better than O(n) reliably because you cannot predict the proceeding elements, meaning you will have to traverse each one.
This is why collections that remain sorted exist.
I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.
I have a String array
String myArray [] = {"user1", "doc2", "doc5", "user2", "doc3", "doc6", "doc8", "user3", "doc 10" }
The meaning is that user1 has doc2 and doc5 ... user3 has only doc10 etc
I want to transform this array into a two dimensional jagged
String myArray2 [] [] = { {"user1", "doc2", "doc5"} , {"user2", "doc3", "doc6", "doc8"} , {"user3", "doc 10"} }
How can I do this most effectively? ( I have a logic that works by encountering an element that has a substring "user" and creating a new element of the jagged array. But I am sure my algorithm is far from the most efficient one)
A few things:
First, it sounds like you already have an algorithm that works. What is the problem? In cases like this there is no "most effective". There is only effective (it produces correct results), and not effective (it produces incorrect results). You have created an algorithm that satisfies your requirements. You are done.
Second, what do you mean by "most efficient"? Are your specific performance requirements not being met? If not, profile, identify the bottleneck, and optimize there. Is this part of the code slowing down your software to the point that an improvement would be noticeable? If not, your performance requirements are satisfied, and you are done.
Or, by "most efficient", do you mean "less lines of code"? Why? Is your code clear? Is it easy to maintain and can a reader easily see what its intentions are? If not, consider adding descriptive comments. If so, then you have nothing to gain and you are done.
If your algorithm is not behaving as intended, then you may post your specific problem, as well as how your actual and expected outputs differ. However, it sounds like you don't have any problems at all here. I recommend moving on to the next development task.
All that said, the way you are doing it is reasonable. Your task is essentially parsing tokens. Go through the array, find user strings, associate following values with that user, store data. There isn't much more you can do.
There is very little that you can do to make it efficient: you do need to create all the arrays that go into the jagged array, and you also need to do all the copying. There are no opportunities for saving much CPU time here.
You can avoid resizing arrays by first counting how many array elements the result is going to have, creating the result array, and then scan the array from the beginning again for the next position of "userX", and calling arrayCopy to copy the elements into the sub-arrays of the output array.
I have 5 sets which has numeric values. I am interested in finding the intersection of all 5 sets.
Right now, I am thinking of the following
Do a Collections.sort() on all 5 sets
Find the shortest set and do a
shortestSet.retainAll(otherSet);
on all the other sets.
Is there a more efficient way of doing this ?
Your solution looks right to me, if we understand that when you write Collections.sort() you are sorting the list of sets, according to their sizes. The rationale would be that, if we are going to use set1.retainAll(set2) (and if the sets are HashSets) each intersection running time should be roughly linear on the number of elements of set1. So it makes sense to start with the smallest one.
Your solution is fine. There would be no need to sort the numbers before calling the retainAll method though.
Try to use
static <E> Sets.SetView<E> intersection(Set<E> set1, Set<?> set2)
In Google Guava
I have a variable number of ArrayList's that I need to find the intersection of. A realistic cap on the number of sets of strings is probably around 35 but could be more. I don't want any code, just ideas on what could be efficient. I have an implementation that I'm about to start coding but want to hear some other ideas.
Currently, just thinking about my solution, it looks like I should have an asymptotic run-time of Θ(n2).
Thanks for any help!
tshred
Edit: To clarify, I really just want to know is there a faster way to do it. Faster than Θ(n2).
Set.retainAll() is how you find the intersection of two sets. If you use HashSet, then converting your ArrayLists to Sets and using retainAll() in a loop over all of them is actually O(n).
The accepted answer is just fine; as an update : since Java 8 there is a slightly more efficient way to find the intersection of two Sets.
Set<String> intersection = set1.stream()
.filter(set2::contains)
.collect(Collectors.toSet());
The reason it is slightly more efficient is because the original approach had to add elements of set1 it then had to remove again if they weren't in set2. This approach only adds to the result set what needs to be in there.
Strictly speaking you could do this pre Java 8 as well, but without Streams the code would have been quite a bit more laborious.
If both sets differ considerably in size, you would prefer streaming over the smaller one.
There is also the static method Sets.intersection(set1, set2) in Google Guava that returns an unmodifiable view of the intersection of two sets.
One more idea - if your arrays/sets are different sizes, it makes sense to begin with the smallest.
The best option would be to use HashSet to store the contents of these lists instead of ArrayList. If you can do that, you can create a temporary HashSet to which you add the elements to be intersected (use the putAll(..) method). Do tempSet.retainAll(storedSet) and tempSet will contain the intersection.
Sort them (n lg n) and then do binary searches (lg n).
You can use single HashSet. It's add() method returns false when the object is alredy in set. adding objects from the lists and marking counts of false return values will give you union in the set + data for histogram (and the objects that have count+1 equal to list count are your intersection). If you throw the counts to TreeSet, you can detect empty intersection early.
In case that is required the state if 2 set has intersection, I use the next snippet on Java 8+ versions code:
set1.stream().anyMatch(set2::contains)