How to perform an inexact compare in Java beans? - java

I have a large (more than 100K objects) collection of Java objects like below.
public class User
{
//declared as public in this example for brevity...
public String first_name;
public String last_name;
public String ssn;
public String email;
public String blog_url;
...
}
Now, I need to search this list for an object where at least 3 (any 3 or more) attributes match those of the object being searched.
For example, if I am searching for an object that has
first_name="John",
last_name="Gault",
ssn="000-00-0000",
email="xyz#abc.com",
blog_url="http://myblog.wordpress.com"
The search should return me all objects where first_name,last_name and ssn match or those where last_name, ssn, email and blog_url match. Likewise, there could be other combinations.
I would like to know what's the best data-structure/algorithm to use in this case. For an exact search, I could have used a hashset or binary search with a custom comparator, but I am not sure what's the most efficient way to perform this type of search.
P.S.
This is not a homework exercise.
I am not sure if the question title is appropriate. Please feel free to edit.
EDIT
Some of you have pointed out the fact that I could use ssn (for ex.) for the search as it is more or less unique. The exmaple above is only illustrative of the real scenario. In reality, I have several objects where some of the fields are null so I would like to search on other fields.

I don't think that there are any specific data structures to make this kind of matching / comparison fast.
At the simple level of comparing two objects, you might implement a method like this:
public boolean closeEnough(User other) {
int count = 0;
count += firstName.equals(other.firstName) ? 1 : 0;
count += lastName.equals(other.lastName) ? 1 : 0;
count += ssn.equals(other.ssn) ? 1 : 0;
count += email.equals(other.email) ? 1 : 0;
...
return count >= 3;
}
To do a large scale search, the only way I can think of that would improve on a simple linear scan (using the method above) would be
create a series of multimaps for each of the properties,
populate them with the User records
Then each time you want to do a query:
query each multimap to get a set of possible candidates,
iterate all of the sets using closeEnough() to find the matches.
You could improve on this by treating the SSN, email address and blog URL properties differently to the name properties. Multiple users with matches on the first three properties should be a rare occurrence, compared with (say) finding multiple users called "John". The way that you have posed the question requires at least 1 of SSN, email or URL to match (to get 3 matches), so maybe you could not bother indexing the name properties at all.

Basically, search for results where ANY of the attributes matches the attribute in the query. This should narrow down the search space to quite a small number of entries. From those results, look for entries that match your criteria. This means you need to go through and count how many attributes match, if this is more than 3 then you've got a match. (This process is relatively slow and you wouldn't want to do it over your whole database.)
In this case, a potential optimisation would be to remove first_name and last_name from the initial filter phase, since they are much more likely to get you multiple results for a query than the other attributes (e.g. a lot of people called "John").
Since three attributes are required to match, removing two from the filter phase won't affect the final outcome.

Just a thought; if you are searching for someone with SSN, you should be able to narrow it down really quickly with that, since only one person is supposed to have one specific SSN.

Related

Is there a way to optimize a code that checks if a string matches the value of each column?

Im using these technologies in my project, grails, java and hibernate.
Now, I'm creating the backend for listing entries from db and there is an option to search. The search is not specific to a column, meaning, If I typed, "MB102" it is possible that this will match an address or an employee code.
Now, the thing is I have 20+ columns to check per entry and that would mean 20+ ladder ifs
if (employee.birthDate.toLowerCase().contains(searchString.toLowerCase())) {
searchEmployeeList.add(employee);
continue;
} else if (employee.civilStatus.toLowerCase().contains(searchString.toLowerCase())) {
searchEmployeeList.add(employee);
continue;
} else if (employee.civilStatus.toLowerCase().contains(searchString.toLowerCase())) {
searchEmployeeList.add(employee);
continue;
}
I just want to know if there is a way to shorten this kind of process? I'm not lazy, I just want to know if there is already an existing function to make our lives easier. Thank you.
Disclaimer: This answer is aimed for the "purpose" of the question, and not the technical question itself. It is showing how to significantly speed-up the entire search, and not for a specific employee.
You can build a suffix tree of all the records of all employees - and let the reference to the relevant Employee object be stored in a leaf of each suffix.
When a query is given, you can follow the query string from the root, if there is some node that matches the query (there is a suffix that starts with the query string) - it means, all leaves of the tree, which are reachable from this node - are matches.
Do some graph search algorithm (BFS or DFS) from this node, and collect all employees in the leaves.
This will give you significantly better performance to collect all employees when their number rises.

Rules on arbitrary combinations of fact attributes

I am a complete Drools noob. I have been tasked with implementing a set rule which, in the absence of nested rules, seems very complex to me. The problem is as follows:
I have a fact called Person with attributes age, gender, income, height, weight and a few others. A person may be classified as level_1, level_2, ..., level_n based on the values of the attributes. For example,
when age < a and any value for other attributes then
classification = level_1.
when gender == female and any
value for other attributes then classification = level_2.
when age < a and gender == female and any value for other
attributes then classification = level_10.
...
So, in any rule any arbitrary combination of attributes may be used. Can anyone help me in expressing this?
The second part of the problem is that the levels are ordered and if a person satisfies more than 1 rule, the highest level is chosen. The only way I can think of of ordering levels is to order the rules themselves using salience. So rules resulting is higher levels will have higher salience. Is there a more elegant way of doing this?
I found a similar question here but that seems to deal with only 1 rule and the OP is probably more familiar with Drools than I am because I have not understood the solution. That talks about introducing a separate control fact but I didn't get how that works.
EDIT:
I would eventually have to create a template and supply the data using a csv. It probably does not matter for this problem, but, if it helps in any way...
The problem of assigning a discrete value to facts of a type based on attribute combinations is what I call "classification problem" (cf. Design Patterns in Production Systems). The simple approach is to write one rule for each discrete value, with constraints separating the attribute value cleanly from all other attribute sets. Note that statements such as
when attribute value age < a and any value for other attributes then classify as level 1
are misleading and must not be used to derive rules, because, evidently, this isn't a correct requirement since we have
when age < a && gender == female (...) then classify as level 10
and this contradicts the former requirement, correctly written as
when age < a && gender == male then classify as level 1
Likewise, the specification for level 2 must also be completed (and it'll become evident that there is no country for old men). With this approach, a classification based on i attributes with just 2 intervals each results in 2n rules. If the number of resulting levels is anywhere near this number, this approach is best. For implementation a decision table is suitable.
If major subsets of the n-dimensional space should fall into the same class, a more economical solution should be used. If, for instance, all women should fall into the same level, a rule selecting all women can be written and given highest precedence; the remaining rules will have to deal with n-1 dimenions. Thus, the simplest scenario would require just n rules, one for each dimension.
It is also possible to describe other intervals in an n-dimensional space, providing the full set of values for all dimensions with each interval. Using the appropriate subset of values for each interval avoids the necessity of ordering the rules (using salience) and ensures that really all cases are handled. Of course, a "fall-through" rule firing with low priority if the level hasn't been set is only prudent.

Searching a TreeMap with concatenated values

Suppose I add an item to a defined TreeMap like:
directory.put(personsLastName + personsFirstName, " Email - " + personsEmail
+ ", Class Status - " + studentStatus);
if I try to do something like:
boolean blnStudentExists = directory.containsValue("freshman");
it will always come out false. I am wondering if this has to do with the way I am populating the map? If so, how can I find all values in the map that are students? My goal is to print just students. Thanks.
Please re-read the TreeMap Javadocs - or the generic Map interface, for that matter - and be very familiar with them for what you're trying to do here.
.containsValue() will search for specific, exact matches in the domain of values that you have inserted into your Map - nothing more, nothing less. You can't use this to search for partial strings. So if you inserted a value of abc#def.com, Class Status - Freshman, .containsValue will only return true for abc#def.com, Class Status - Freshman - not just for Freshman.
Where does this leave you?
You could write your own "search" routine that iterates through each value in the map, and performs substring matching for what you are searching for. Not efficient for large numbers of values. You will also need to worry about the potential for confusing delimiters between fields, if/as you add more.
You could create and use several parallel maps - one that maps to class statuses, another to emails, etc.
You could use a database (or an embedded database - pick your flavor) - which looks to be what you're trying to create here anyway. Do you really need to re-create the wheel?
For this matter - you don't want to be searching by your values, anyway. This goes against the exact purpose of a Map - Hash, Tree, or otherwise. Searches by your keys are where any efficiencies will lie. In most implementations (including the out-of-box TreeMap and HashMap) - searches against values will have to scan the entire Map structure anyway (or at least, until it can bail out after finding the first match).

Fuzzy Matching Duplicates in Java

I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.

most efficient Java data structure for searching triples of strings

Suppose I have a large list (around 10,000 entries) of string triples as such:
car noun yes
dog noun no
effect noun yes
effect verb no
Suppose I am presented with a string double - for example, (effect, verb) - and I need to quickly look in the list to see if the pair appears and, if it does, whether its value is yes or no. (For this example the double does appear and the value is "no".)
What is the best data structure in Java to store the list and the most efficient way to perform the search? I am running hundreds of thousands of these searches so speed is of the essence.
Thanks!
You might consider using a HashMap<YourDouble, String>. Searches will be O(1).
You could either create an object, YourDouble which holds the first two values, or else append one to the other -- if values will still be unique -- and use HashMap<String, String>.
I would create a HashMultimap for each type of search you want, e.g. "all three", "each pair", and "each single field". When you build the list, populate all the different maps, then you can fetch from whichever map is appropriate for your query.
(The downside is that you'll need a type for at least each arity, e.g. use just String for the "single field" maps, but a Pair for the two-field maps, and a Triple for the three-field map.)
You could use a HashMap where the key is the concatenation of the first two strings, the ones which you'll use for lookups, and the value is a Boolean, representing the yes and no strings.
Alternatively, it seems the words in the second column would be fewer, since they represent categories. You could have a HashMap<String, HashMap<String, Boolean>> where you first index by e.g. "noun", "verb" etc. and then you index by e.g. "car", "dog", "effect", to get to your boolean. This would probably be more space-efficient.
10k doesn't seem that large to me. Have you tried a DB?
The place to look for information like this is the Semantic Web. A number of projects work on Triple Stores of just this type. There's a list at the bottom of the Triple Store page of implementations.
As far as java is concerned your algorithms are almost certainly going to be language dependent and if you find a good algorithm implemented in C its java port will also be fast.
Also, what's your data set look like? Are there a lot of 2 matches such that subject and verb are often the same? How many matches are you expecting to get? MapReduce will work work well for finding one match in 10k but won't work as well doing a query that returns a 8k of 10k where the query can't be easily partitioned.
There's a query language made just for this problem too: SPARQL. The bigdata blog has some good insights, though again 10k doesn't seem that large.

Categories

Resources