I'm trying to write a method that will compute all permutations of a power set where order matters. I believe these are called "arrangements." What I mean by this is:
{a} -> {{a}, {}}
{a,b} -> {{a,b}, {b,a}, {a}, {b}, {}}
{a,b,c} -> {{a,b,c}, {a,c,b}, {b,a,c}, {b,c,a}, {c,a,b}, {c,b,a}, {a,b}, {a,c}, {b,a}, {b,c}, {c,a}, {c,b}, {a}, {b}, {c}, {}}
etc. My impression is that, given a set S, I should generate every permutation of every subset of the powerset of S. So first generate the powerset, then map a permutation function onto each set.
The problem is that this is immensely complex -- something like O(∑n!/k!) with k=0..n.
I'm wondering if there are any existing algorithms that do this sort of thing very efficiently (perhaps a parallel implementation). Or perhaps even if a parallel powerset algorithm exists and a parallel permutation algorithm exists, I can combine the two.
Thoughts?
The guava library provided by google contains different methods to permute collections.
See the javadoc of class com.google.common.collect.Collections2 here.
To do this you first generate the combinations for 1-n slots where n is the number of elements in the power set. For example, if you have 3 elements, then you will have:
C( 3, 3 ) = 1 combination (a b c)
C( 3, 2 ) = 3 combinations (a b) (a c) (b c)
C( 3, 1 ) = 3 combinations (a) (b) (c)
Now, you generate the permutations for each combination.
There are well known algorithms to calculate permutations and combinations. For example, Knuth's "The Art of Computer Programming", volume 4A, Sections 7.2.1.2 and 7.2.1.3, explain exactly how to construct the relevant algorithms.
Related
I have a project which it needs to compare to text documents and find the similarity rate between every single sentence and the general similarities of the texts.
I did some transforming on texts like lowering all words,deleting duplicate words,deleting punctuations except fullstops. After doing some operations, i had 2 arraylists which include sentences and the words all seperated. It looks like
[["hello","world"],["welcome","here"]]
Then i sorted every sentence alphabetically.After all these, i'm comparing all the words one by one,doing linear search but if the word which i'm searching is bigger than i'm looking (ASCII of first character like world > burger) ,i'm not looking remaining part,jumping other word. It seems like complicated but i need an answer of " Is there any faster,efficient common algorithms like Boyer Moore,Hashing or other?" . I'm not asking any code peace but i need some theoretical advices.Thank you.
EDIT:
I should've tell the main purpose of the project. Actually it is kinda plagiarism detector.There are two txt files which are main.txt and sub.txt.The program will compare them and it gives an output something like that:
Output:
Similarity rate of two texts is: %X
{The most similar sentence}
{The most similar 2nd sentence}
{The most similar 3d sentence}
{The most similar 4th sentence}
{The most similar 5th sentence}
So i need to find out sub.txt similarity rate to main.txt file.I thought that i need to compare all the sentences in two files with each other.
For instance, main.txt has 10 sentences and sub.txt has 5 sentences,
there will be 50 comparison and 50 similarity rate will be calculated
and stored.
Finally i sort the similarty rates and print the most 5 sentences.Actually i've done the project,but it's not efficient. It has 4 nested for loops and compares all words uncountable times and complexity becomes like O(n^4) ( maybe not that much) but it's really huge even in the worst case. I found Levenshtein distance algorithm and Cosine similarity algorithms but i'm not sure about them. Thanks for any suggestion!
EDIT2:
For my case similarity between 2 sentence is like:
main_sentence:"Hello dude how are you doing?"
sub_sentence:"Hello i'm fine dude."
Since intersection is 2 words ["hello","dude"]
The similarity is : (length of intersected words)*100/(length of main text)
For this case it's: 2*100/6 = %33,3
As a suggestion, and even if this is not a "complete answer" to your issue, comparing Strings is usually a "heavy" operation (even if you first check their length, which, in fact, is one of the first things the equals() method already performs when comparing Strings)
What I suggest is doing next: create a dummy hashcode()-like method. It won't be a real hashcode(), but the number associated to the order in which that word was read by your code. Something like a cryptographic method, but much simpler.
Note that string.hashCode() won't work, as the word "Hello" from the first document wouldn't return the same hashcode than the word "Hello" from the second document.
Data "Warming" - PreConversion
Imagine you have a shared HashMap<String,Integer> (myMap), which key is an String and the value an Integer. Note that HashMap's hashing in java with String keys lower than 10 characters (which usually are, in english language) is incredibly fast. Without any check, just put each word with its counter value:
myMap.put(yourString, ++counter);
Let's say you have 2 documents:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
I assume you already lowercased all words, and removed duplicates.
You start reading the first document and assigning each word to a number. The map would look like:
KEY VALUE
welcome 1
mate 2
what 3
are 4
you 5
doing 6
here 7
Now with the second document. If a key is repeated, the put() method will update its value. So:
KEY VALUE
welcome 1
mate 8
what 3
are 13
you 14
doing 6
here 11
I 9
was 10
before 12
dumb 15
Once complete, you create another HashMap<Integer,String> (reverseMap), this way in reverse:
KEY VALUE
1 welcome
8 mate
3 what
13 are
14 you
6 doing
11 here
9 I
10 was
12 before
15 dumb
You convert both documents into a List of Integers, so they look like:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
to:
listOne - [1, 8, 3, 13, 14, 6, 11]
listTwo - [8, 9, 10, 11, 12, 13, 14, 15]
Duplicate words, positions and sequences
To find the duplicated within both documents:
First, create a deep clone of one of the lists, for example, listTwo. A deep clone of a List of Integers is relatively easy to perform. Calling it listDuplicates as that will be its objective.
List<Integer> listDuplicates = new ArrayList<>();
for (Integer i:listTwo)
listDuplicates.add(new Integer(i));
Call retainAll:
listDuplicates.retainAll(listOne);
The result would be:
listDuplicates- [8,11,13,14]
So, from a total of listOne.size()+listTwo.size() = 15 words found on 2 documents, 4 are duplicates are 11 are unique.
In order to get the converted values, just call:
for (Integer i : listDuplicates)
System.out.println(reverseMap.get(i)); // mate , here, are, you
Now that duplicates are identified, listOne and listTwo can also be used now in order to:
Identify the position on each list (so we can get the difference in the positions of this words). The first word would have -1 value, as its the first one and doesn't have a diff with the previous one, but won't necessarily mean they are consequent with any other (they are just the first duplicates).
If the next element has -1 value, that means the [8] and [11] would aslo be consecutive:
doc1 doc2 difDoc1 difDoc2
[8] 2 1 -1 (0-1) -1 (0-1)
[11] 7 4 -5 (2-7) -3 (1-4)
[13] 4 6 3 (7-4) -2 (4-6)
[14] 5 7 -1 (4-5) -1 (6-7)
In this case, the distance shown in [14] with its previous duplicate (the diff between [13] and [14]) is the same in both documents: -1: that means that not only are duplicates, but both are consequently placed in both documents.
Hence, we've found not only duplicate words, but also a duplicate sequence of two words between those lines:
[13][14]--are you
The same mechanism (identifying a diff of -1 for the same variable in both documents) would also help to find a complete duplicate sequence of 2 or more words. If all the duplicates show a diff of -1 in both documents, that means we've found a complete duplicate line:
In this example this is shown clearer:
doc1- "here i am" [4,5,6]
doc2- "here i am" [4,5,6]
listDuplicates - [4,5,6]
doc1 doc2 difDoc1 difDoc2
[4] 1 1 -1 (0-1) -1 (0-1)
[5] 2 2 -1 (1-2) -1 (1-2)
[6] 3 3 -1 (2-3) -1 (2-3)
All the diffs are -1 for the same variable in both documents -> all duplicates are next to each other in both documents --> The sentence is exactly the same in both documents. So, this time, we've found a complete duplicate line of 3 words.
[4][5][6] -- here i am
Apart of this duplicate sequence search, this difference table would also be helpful when calculating the variance, median,... from the duplicate words, in order to get some kind of "similarity" factor (something like a basic indicative value of equity between documents. By no mean definitive, but somehow helpful)
Unique values - helpful in order to get a non-equity indicative ?
Similar mechanisms would be used to get those unique values. For example, by removing the duplicates from the reverseMap:
for (Integer i: listDuplicates)
reverseMap.remove(i);
Now the reverseMap only contains unique values. reverseMap.size() = 11
KEY VALUE
1 welcome
3 what
6 doing
9 I
10 was
12 before
15 dumb
In order to get the unique words:
reverseMap.values() = {welcome,what,doing,I,was,before,dumb}
If you need to know which unique words are from which document, you could use the reverseMap (as the Lists may be altered after you execute methods such as retainAll on them):
Count the number of words from the 1st document. This time, 7.
If the key of the reverseMap is <=7, that unique word comes from the 1st document. {welcome,what,doing}
If the key is >7, that unique word comes from the 2nd document. {I,was,before,dumb}
The uniqueness factor could also be another indicative, this way, a negative one (as we are searching for similarities here). Still could be really helpful.
equals and hashCode - avoid
As the hashcode() method for Strings won't return the same value for two same words (only for two same String Object references), wouldn't work here. String.equals() method works by comparing the chars (also checks for the length, as you do manually) which would be total overkill if used for big documents:
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String) anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
My oppinion is to avoid this as much as possible, specially hashCode() should never be used, as:
String one = "hello";
String two = "hello";
one.hashCode() != two.hashCode()
There's an exception to this, but only when the compiler interns strings; Once you load thousands of them, that won't ever again be used by the compiler. In those rare cases where both String Objects reference the same cached memory address, this will also be true:
one.hashCode() == two.hashCode() --> true
one == two --> true
But those are really unusual exceptions, and once string internship doesn't kick, those hashCodes won't be equal and the operator == to compare Strings will return false even if the Strings hold the same value (as usual, because it works comparing their memory addresses).
The essential technique is to see this is as a multi-stage process. The key is that you're not trying to compare every document with every other document, but rather, you have a first pass that identifies small clusters of likely matches in essentially a one-pass process:
(1) Index or cluster the documents in a way that will allow candidate matches to be identified;
(2) Identify candidate documents that may be a match based on those indexes/clusters;
(3) For each cluster or index match, have a scoring algorithm that scores the similarity of a given pair of documents.
There are a number of ways to solve (1) and (3), depending on the nature and number of the documents. Options to consider:
For certain datasets, (1) could be as simple as indexing on unusual words/clombinations of words
For more complex documents and/or larger datasets, you will need to do something sometimes called 'dimension reduction': rather than clustering on shared combinations of words, you'll need to cluster on combinations of shared features, where each feature is identified by a set of words. Look at a feature extraction technique sometimes referred to as "latent semantic indexing" -- essentially, you model the documents mathematically as a matrix of "words per feature" multiplied by "feature per document", and then by factorising the matrix you arrive at an approximation of a set of features, along with a weighted list of which words make up each feature
Then, once you have a means of identifying a set of words/features to index on, you need some kind of indexing function that will mean that candidate document matches have identical/similar index keys. Look at cosine similarity and "locality-sensitive hashing" such as SimHash.
Then for (3), given a small set of candidate documents (or documents that cluster together in your hashing system), you need a similarity metric. Again, what method is appropriate depends on your data, but conceptually, one way you could see this as "for each sentence in document X, find the most similar document in document Y and score its similarity; obtain a 'plagiarism score' that his the sum of these values". There are various ways to define 'similarity score' between two strings: e.g. longest common subsequence, edit distance, number of common word pairs/sequences...
As you can probably imagine from all of this, there's no single algorithm that will hand you exactly what you need on a plate. (That's why entire companies and research departments are dedicated to this problem...) But hopefully the above will give you some pointers.
I have 3 dynamic sets of elements on one side. e.g.
a = [101,102,104] //possible values 101 to 115
b = [201,202] //possible values 201 to 210
c = [301,302,303,304] //possible values 301 to 305
I generate all combinations of these 3 sets e.g.
setA = ["101|201|301,303", "101,104|202|304", "101,104|202|301,304", ...]
a,b,c are out of picture at this point. Now I want to match all elements of setA against another set setB which has only one element from each category. e.g.
setB = ["101|202|304","104|202|301" ,"102|202|303", ...]
There's an n to m mapping between setA and setB. i.e One combination from setA can have multiple match in setB and vice versa.
Matching criteria: for any element of setB (e.g."101|202|304") if all of its parts (101,202,304) are contained in some combinations of setA (e.g. "101,104|202|304", "101,104|202|301,304") then consider it a match. so in this example "101|202|304" is said to have a match with both "101,104|202|304" and "101,104|202|301,304"
Currently I have O(n^2) time and O(n) space algorithm but I am really looking for some improvements as this calculation repeats for many such sets. (It's actually a reducer task of a hadoop map-reduce where I generate all combinations of dimensions and aggregate measures that qualifies for given combination). Any framework level optimization is welcome too. e.g. breaking down job in multiple-jobs.
Go through B and pick out all the first elements you have, turning them back into a set. For each element of that set, make a map from that element to everything in B that starts with that element. Now you have a map: firstElement -> subsetOfBStartingWithThat.
Now do the same for the subset and second elements, etc. until you have a series of maps
firstElement -> secondElement -> thirdElement -> ... -> entry in B.
Now you run through each entry in A, and use the maps to tell whether anything is there. If yes, add it to a set. If no, leave it empty. Use this to build a map from elements of A to sets of elements of B.
Then reverse the process by making a map from B to sets of A by iterating through your A -> B map and adding the pair in the opposite orientation.
You have O(m) space to create the B-lookup-map, and you'll spend O(m+n) time doing the scanning since set lookup is linear. Building the final lookup sets will take space (and time) proportional to m * n/2^k where k is the number of separate sets (3 in your case). There's no way to avoid that: this is actually how many links there are. (To see why, note that each element of each source set can be viewed as a bit that is either on or off, and you require that the bit be on. That happens only 1/2^kth of the time, which is 1/8 in your case.
So you're pretty much stuck at an n^2 step. It's inherent in the problem unless you don't need to be comprehensive. If not, you can use the scheme I outlined above to find a match much less expensively.
Slightly different solution than Rex Kerr. First, I created a mapB from setB elements. Each element entirely represent a key and values associated are different measures in my use case.
Then, I iterated each sub element of setA. Created 3 sublists from each element and iterated each sublist in a way to for a key (3 nested for loops). I looked up for that key in mapB. If found I counted mapB values towards this combination. So at end of the all iterations I have aggregated values from setB which qualifies for given combination of setA. That's it. Let me know if anyone want me to be more detailed on this.
ps - my jobs are running 4 time faster with this changes(2 hours from 7 hours)
The question is not clear, but if I understood correctly, I would solve it this way:
setA does not exist for me.
a, b, and c are out of the picture too.
I first pick a common element in "setB" (202 in your example), and for the rest of elements (101, 102, 104, 201, 301, 302, etc.) I would iterate 4 states for each of them:
0 = it's not in setB
1 = it's in first "entry" of setB
2 = it's in second "entry" of setB
3 = it's in both entries of setB
I am assuming setB has 2 "entries" always.
Is there any standard interface or approach usable in collections/streams (max, sort) for the situation where one might need to compare on multiple sides/objects at once?
The signature could be something like
compare(T... toCompare)
instead of
compare(T object1, T object2)
what I would like is do an implementation that works for comparing operations in Java APIs. But from what I saw, I think I have to adhere mandatory to unitary comparations.
UPDATE: Practical example: I'd like to have a Comparator implementation interpreted by Collections/Stream.max() that allowed me to make multiside comparisons not unitary comparisons (i.e, that accepts multiple T in the compare method). The max function returns the element so that element is the winner of a comparison mechanism, custom implemented, of it against ALL the others, not the winner of n battles 1 vs 1.
UPDATE2: More specific example:
I have (Pineapple,Pizza,Yogurt), and max returns the item such that my custom 1 -> n comparison returns biggest quotient. This quotient could be something like degreeOfYumie. So Pineapple is more yummie than Pizza+Yogurt, Pizza is equally yummie than Pineapple+yogurt, and Yogurt is equally yummie than Pizza+Pineapple. So the winner is Pineaple. If I did that unitary, all the ingredients would be equally yummie. Is there any mechanism for implementing a comparator/comparable as that? Perhaps a "sortable" interface that works on collections, streams and queues?
There is no need for a specialized interface. If you have a Comparator that conforms to the specification, it will be transitive and allow comparing multiple objects. To get the maximum out of three or more elements, simply use, e.g.
Stream.of(42, 8, 17).max(Comparator.naturalOrder())
.ifPresent(System.out::println);
// or
Stream.of("foo", "BAR", "Baz").max(String::compareToIgnoreCase)
.ifPresent(System.out::println);
If you are interested in the index of the max element, you can do it like this:
List<String> list=Arrays.asList("foo", "BAR", "z", "Baz");
int index=IntStream.range(0, list.size()).boxed()
.max(Comparator.comparing(list::get, String.CASE_INSENSITIVE_ORDER))
.orElseThrow(()->new IllegalStateException("empty list"));
Regarding your updated question…
You said you want to establish an ordering based on the quotient of an element’s property and the remaining elements. Let’s think this through
Suppose we have the positive numerical values a, b and c and want to establish an ordering based on a/(b+c), b/(a+c) and c/(a+b).
Then we can transform the term by extending the quotients to have a common denominator:
a(a+c)(a+b) b(b+c)(b+a) c(c+b)(c+a)
--------------- --------------- ---------------
(a+b)(b+c)(a+c) (a+b)(b+c)(a+c) (a+b)(b+c)(a+c)
Since common denominators have no effect on the ordering we can elide them and after expanding the products we get the terms:
a³+a²b+a²c+abc b³+b²a+b²c+abc c³+c²a+c²b+abc
Here we can elide the common summand abc as it has no effect on the ordering.
a³+a²b+a²c b³+b²a+b²c c³+c²a+c²b
then factor out again
a²(a+b+c) b²(a+b+c) c²(a+b+c)
to see that we have a common factor which we can elide as it doesn’t affect the ordering so we finally get
a² b² c²
what does this result tell us? Simply that the quotients are proportional to the values a, b and c, thus have the same ordering. So there is no need to implement a quotient based comparator when we can prove it to have the same outcome as a simple comparator based on the original values a, b and c.
(The picture would be different if negative values were allowed, but since allowing negative values would create the possibility of getting zero as denominator, they are off this use case anyway)
It should be emphasized that any other result for a particular comparator would prove that that comparator is unusable for standard Comparator use cases. If the combined values of all other elements had an effect on the resulting order, in other words, adding another element to the relation would change the ordering, how should an operation like adding an element to a TreeSet or inserting it at the right position of a sorted list work?
The problem with comparing multiple objects at once is what to return.
A Java comparator returns -1 if the first object is "smaller than the second one, 0 if they are equals and 1 if the first one is the "bigger" one.
If you compare more than two objects, an integer wouldn't suffice to describe the difference between said objects.
If you have a normal Comparable<T> you can combine it any way you want. From being able to compare two things you can build anything (see different sorting algorithms, which usually only need a < implementation).
For example here's a naive one for "you could say if it's bigger, equal or smaller than ANY of the objects"
<T extends Comparable<T>> int compare(T... toCompare) {
if (toCompare.length < 2) throw Nothing to compare; // or return something
T first = toCompare[0];
int smallerCount;
int equalCount;
int biggerCount;
for(int i = 1, n = toCompare.length; i < n; ++i) {
int compare = first.compareTo(toCompare[i]);
if(compare == 0) {
equalCount++;
} else if(compare < 0) {
smallerCount++;
} else {
biggerCount++;
}
}
return someCombinationOf(smallerCount, equalCount, biggerCount);
}
However I couldn't figure out a proper way of combining them, what about the sequence (3, 5, 3, 1) where 3 is smaller than 5, equal to 3 and bigger than 1, so all counts are 1; here all your "it's bigger, equal or smaller than ANY" conditions are true at the same time, however you could return the counts as an object if it helps to defer the combination of counts to a later point in time.
I want to implement a genetic algorithm (I'm not sure about the language/framework yet, maybe Watchmaker) to optimize the mixing ratio of some fluids.
Each mix consists of up to 5 ingredients a, b, c, d, e, which I would model as genes with changing values. As the chromosome represents a mixing ratio, there are (at least) two additional conditions:
(1) a + b + c + d + e = 1
(2) a, b, c, d, e >= 0
I'm still in the stage of planning my project, therefore I can give no sample code, however I want to know if and how these conditions can be implemented in a genetic algorithm with a framework like Watchmaker.
[edit]
As this doesn't seem to be straight forward some clarification:
The problem is condition (1) - if each gene a, b, c, d, e is randomly and independently chosen, the probability of this to happen is approximately 0. I would therefore need to implement the mutation in a way where a, b, c, d, e are chosen depending on each other (see Random numbers that add to 100: Matlab as an example).
However, I don't know if this is possible and if it this would be in accordance with evolutionary algorithms in general.
The first condition (a+b+c+d+e=1) can be satisfied by having shorter chromosomes, with only a,b,c,d. The e value can then be represented (in the fitness function or for later use) by e:=1-a-b-c-d.
EDIT:
Another way to satisfy the first condition would be to normalize the values:
sum:= a+b+c+d+e
a:= a/sum;
b:= b/sum;
c:= c/sum;
d:= d/sum;
e:= e/sum;
The new sum will then be 1.
For the second condition (a,b,c,d,e>=0), you can add an approval phase for the new offspring chromosomes (generated by mutation and/or crossover) before throwing them into the gene pool (and allowing them to breed), and reject those who dont satisfy the condition.
I have 2 questions:
I would like to generate the permutations of subsets e.g. There are 20 possible amino acids and 5 positions where they can occur. What are the total permutations that can occur (in text)
Once I have this list of permutations certain values will be assinged to each one and I would like to look up any given permutation at run time. The first idea that comes to mind is a look-up table, but I was wondering if there might be a better way of doing this.
You want combinations of length 5, not permutations. This is a standard problem, which can be solved with recursion. Use CombinationGenerator if you don't want to write it yourself.
Number the combinations using base 20 (not to be confused with the chemical definition of base). Use a hashtable if you'll be storing data for a limited subset of combinations, or a look-up array if you'll be most of them.