I'm trying to solve a special case of the general constraint satisfaction problem in java.
Basically I have multiple variables, each one taking discrete values, and every variable is defined by the set of all possible values it has (think of it like an enumeration in Java, that would help).
I also have multiple groupements of conditions (think of a condition as a system of multiple equations on the variables, and they are all unary constraints: in other words of the form variable = possible value), the goal is to find if there's a set of variable values that satisfies at least one condition from each group (it might satisfy multiple ones from the same group). I will call this particular set a solution. What I'm looking for is all possible solutions.
The only Idea I have so far is basically brute force.
This is a concrete example so things are clearer:
s = {a,b,c}, v = {1,2,3}, n = {p,k,m}.
First condition group:
c1 = {s=a and v=2}, c2 = {s=b}.
Second condition group:
c1={n=p and v=2}.
Third condition group:
c1={s=a and n=p}, c2 = {s=c}.
In this situation, if we take (s=a,v=2,n=p): it satisfies the first condition of all three groups, and is, therefore, a solution to the problem.
(s=b,v=2,n=p) however is not a solution, because it doesn't verify any of the third group's conditions. In fact, the number of possible solutions here is 1.
Please note that the conditions within a group are not necessarily mutually exclusive.
Any insight into a possible way to go more efficiently than by brute force be it a data structure or an algorithm would be great since I will have to solve millions of such systems of quite the number of variables (thirty variables tops of around 15 values each, and a hundred such conditions tops).
Edit1: Data Constraints
If N is the number of variables each problem will have then N<=30.
If |V| is the maximum number of elements a variable V can have, then I know that Max(|Vi|)<=15 for every variable Vi in a problem.
I also know that if C is the number of constraints per problem, then C<100.
Lastly, I know that statistically speaking, the number of solutions for the problem will be small, meaning that most problems will have one single solution, and the likelihood of having more than 8 solutions is less than 99% of the time. For the sake of optimization, we can even assume that I'm never interested in any problem that has more than 10 solutions ever.
Related
I need to enumerate all basis corresponding to all extreme points of a LP with the CPLEX API in Java. Unfortunately I did not find any way to do this with CPLEX. Is there a solution ?
If not, I will do this myself but I will need to play with basis. Is any simple way with CPLEX to enumerate all basis and check if a basis is a feasible solution ?
The short answer: no.
There is no easy way to do this. One possible approach, but somewhat cumbersome, is to encode the basis using binary variables. E.g.:
xb[i] = 1 for basic variables
0 for non-basic variables
We need to add constraints on non-basic variables: they will be at bound. I.e. for a non-negative variable x[i] we have
xb[i]=0 => x[i]=0
(this is an indicator constraint). Furthermore we know that
sum(i,xb[i]) = m
(the number of basic variables is equal to the number of rows in the model).
Then use Cplex's solution pool to enumerate all possible feasible bases. An illustration for this approach is shown in this link. (This particular example enumerates all optimal bases, but it is not difficult to tell Cplex to enumerate all feasible bases).
I have two JavaRDD<Double> called rdd1 and rdd2 over which I'd like to evaluate some correlation, e.g. with Statistics.corr(). The two RDDs are generated with many transformations and actions, but at the end of the process, they both have the same number of elements. I know that two conditions must be respected in order to evaluate the correlation, that are related (as far as I understood) to the zip method used in the correlation function. Conditions are:
The RDDs must be split over the same number of partitions
Every partitions must have the same number of elements
Moreover, according to the Spark documentation, I'm using methods over the RDD which preserve ordering, so that the final correlation will be correct (although this wouldn't raise any exception). Now, the problem is that even if I'm able to keep the number of partition consistent, for example with the code
JavaRDD<Double> rdd1Repatitioned = rdd1.repartition(rdd2.getNumPartitions());
what I don't know how to do (and what is giving me exceptions) is to control the number of entries in every partition. I found a workaround that, for now, is working, that is re-initializing the two RDDs I want to correlate
List<Double> rdd1Array = rdd1.collect();
List<Double> rdd2Array = rdd2.collect();
JavaRDD<Double> newRdd1 = sc.parallelize(rdd1Array);
JavaRDD<Double> newRdd2 = sc.parallelize(rdd2Array);
but I'm not sure this guarantees me anything about the consistency. Second, it might be really expensive computational-wise in some situations. Is there a way to control the number of elements in each partition, or in general to realign the partitions in two or more RDDs (I know more or less how the partitioning system works, and I understand that this might be complicated from the distribution point of view)?
Ok, this worked for me:
Statistics.corr(rdd1.repartition(8), rdd2.repartition(8))
I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.
I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.
Is there any algorithm to reduce sat problem.
Satisfiability is the problem of determining if the variables of a given Boolean formula can be assigned in such a way as to make the formula evaluate to TRUE. Equally important is to determine whether no such assignments exist, which would imply that the function expressed by the formula is identically FALSE for all possible variable assignments. In this latter case, we would say that the function is unsatisfiable; otherwise it is satisfiable. To emphasize the binary nature of this problem, it is frequently referred to as Boolean or propositional satisfiability. The shorthand "SAT" is also commonly used to denote it, with the implicit understanding that the function and its variables are all binary-valued.
I have used genetic algorithms to solve this, but it would be easier if is reduced first?.
Take a look at Reduced Order Binary Decision Diagrams (ROBDD). It provides a way of compressing boolean expressions to a reduced canonical form. There's plenty of software around for performing the BDD reduction, the wikipedia link above for ROBDD contains a nice list of external links to other relevant packages at the bottom of the article.
You could probably do a depth-first path-tree search on the formula to identify "paths" - Ie, for (ICanEat && (IHaveSandwich || IHaveBanana)), if "ICanEat" is false, the values in brackets don't matter and can be ignored. So, right there you can discard some edges and nodes.
And, if while you're generating this depth-first search, the current Node resolves to True, you've found your solution.
What do you mean by "reduced", exactly? I'm going to assume you mean some sort of preprocessing beforehand, to maybe eliminate or simplify some variables or clauses first.
It all depends on how much work you want to do. Certainly you should do unit propagation until it completes. There are other, more expensive things you can do. See the pre-processing section of the march_dl page for some examples.