Searching a TreeMap with concatenated values

Searching a TreeMap with concatenated values - java

Suppose I add an item to a defined TreeMap like:
directory.put(personsLastName + personsFirstName, " Email - " + personsEmail
+ ", Class Status - " + studentStatus);
if I try to do something like:
boolean blnStudentExists = directory.containsValue("freshman");
it will always come out false. I am wondering if this has to do with the way I am populating the map? If so, how can I find all values in the map that are students? My goal is to print just students. Thanks.

Please re-read the TreeMap Javadocs - or the generic Map interface, for that matter - and be very familiar with them for what you're trying to do here.
.containsValue() will search for specific, exact matches in the domain of values that you have inserted into your Map - nothing more, nothing less. You can't use this to search for partial strings. So if you inserted a value of abc#def.com, Class Status - Freshman, .containsValue will only return true for abc#def.com, Class Status - Freshman - not just for Freshman.
Where does this leave you?
You could write your own "search" routine that iterates through each value in the map, and performs substring matching for what you are searching for. Not efficient for large numbers of values. You will also need to worry about the potential for confusing delimiters between fields, if/as you add more.
You could create and use several parallel maps - one that maps to class statuses, another to emails, etc.
You could use a database (or an embedded database - pick your flavor) - which looks to be what you're trying to create here anyway. Do you really need to re-create the wheel?
For this matter - you don't want to be searching by your values, anyway. This goes against the exact purpose of a Map - Hash, Tree, or otherwise. Searches by your keys are where any efficiencies will lie. In most implementations (including the out-of-box TreeMap and HashMap) - searches against values will have to scan the entire Map structure anyway (or at least, until it can bail out after finding the first match).

Related

Hash items in a 2d array, but only on one index

So, I have a 2d array (really, a List of Lists) that I need to squish down and remove any duplicates, but only for a specific field.
The basic layout is a list of Matches, with each Match having an ID number and a date. I need to remove all duplicates such that each ID only appears once. If an ID appears multiple times in the List of Matches, then I want to take the Match with the most recent date.
My current solution has me taking the List of Matches, adding it to a HashSet, and then converting that back to an ArrayList. However all that does is remove any exact Match duplicates, which still leaves me with the same ID appearing multiple times if they have different dates.
Set<Match> deDupedMatches = new HashSet<Match>();
deDupedMatches.addAll(originalListOfMatches);
List<Match> finalList = new ArrayList<Match>(deDupedMatches)
If my original data coming in is
{(1, 1-1-1999),(1, 2-2-1999),(1, 1-1-1999),(2, 3-3-2000)}
then what I get back is
{(1, 1-1-1999),(1, 2-2-1999),(2, 3-3-2000)}
But what I am really looking for is a solution that would give me
{(1, 2-2-1999),(2, 3-3-2000)}
I had some vague idea of hashing the original list in the same basic way, but only using the IDs. Basically I would end up with "buckets" based on the ID that I could iterate over, and any bucket that had more than one Match in it I could choose the correct one for. The thing that is hanging me up is the actual hashing. I am just not sure how or if I can get the Matches broken up in the way that I am thinking of.

If I understand your question correctly you want to take distinct IDs from a list with the latest date by which it occurs.
Because your Match is a class it is not as easy to compare with each other because of the fields not being looked at by Set.
What I would do to get around this problem is use a HashMap which allows distinct keys and values to be linked.
Keys cannot be repeated, values can.
I would do something like this while looping through:
if(map.putIfAbsent(match.getID(), match) != null &&
map.get(match.getID()).getDate() < match.getDate()){
map.replace(match.getID(),match);
}
So what that does is it loops through your matches.
Put the current Match with its ID in if that ID doesn't exist yet.
.putIfAbsent returns the old value which is null if it did not exist.
You then check if there was an item in the map at that ID using the putIfAbsent (2 birds with one stone).
after that it is safe to compare the two dates (one in map and one from iteration - the < is an exams for your comparison method)
if the new one is later then replace the current Match.
And finally in order to get your list you use .getValues()
This will remove duplicate IDs and leave only the latest ones.
Apologies for typos or code errors, this was done on a phone. Please notify me of any errors in the comments.
Java 7 does not have the .putIfAbsent and .replace functionality, but they can be substitued for .contains and .put

Java collections, search by "part of a value"

I searched many posts but I didnt find answer. I'd like to search and access values in collection by searching by value. My object type is DictionaryWord with two values: String word and int wordUsage (number of times the word was used). I was wondering which collection would be the fastest. If I write down "wa" I'd like it to give me e.g. 5 strings that start with these letters. Any list or set would be probably way too slow as I have 100 000 objects.
I thought about using a HashMap by making its key values String word and its values int wordUsage. I could even write my own hash() function to just give every key same value after hashing - key: "writing", hash value: "writing". Considering there are no duplicates would it be a good idea or should I look for something else?
My point is: how and what do I use to search for values that have some part of the value used in the search condition. For example writing down "tea" i find in collection values like: "tea", "teacher", "tear", "teaching" etc.

The fastest I can think of is a binary search tree. I found this to be very helpful and it should make it clear why a tree is the best option.

Probably, you need prefix tree. Take a look at Trie wiki page for further information.

Fuzzy Matching Duplicates in Java

I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?

I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.

Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.

most efficient Java data structure for searching triples of strings

Suppose I have a large list (around 10,000 entries) of string triples as such:
car noun yes
dog noun no
effect noun yes
effect verb no
Suppose I am presented with a string double - for example, (effect, verb) - and I need to quickly look in the list to see if the pair appears and, if it does, whether its value is yes or no. (For this example the double does appear and the value is "no".)
What is the best data structure in Java to store the list and the most efficient way to perform the search? I am running hundreds of thousands of these searches so speed is of the essence.
Thanks!

You might consider using a HashMap<YourDouble, String>. Searches will be O(1).
You could either create an object, YourDouble which holds the first two values, or else append one to the other -- if values will still be unique -- and use HashMap<String, String>.

I would create a HashMultimap for each type of search you want, e.g. "all three", "each pair", and "each single field". When you build the list, populate all the different maps, then you can fetch from whichever map is appropriate for your query.
(The downside is that you'll need a type for at least each arity, e.g. use just String for the "single field" maps, but a Pair for the two-field maps, and a Triple for the three-field map.)

You could use a HashMap where the key is the concatenation of the first two strings, the ones which you'll use for lookups, and the value is a Boolean, representing the yes and no strings.
Alternatively, it seems the words in the second column would be fewer, since they represent categories. You could have a HashMap<String, HashMap<String, Boolean>> where you first index by e.g. "noun", "verb" etc. and then you index by e.g. "car", "dog", "effect", to get to your boolean. This would probably be more space-efficient.

10k doesn't seem that large to me. Have you tried a DB?
The place to look for information like this is the Semantic Web. A number of projects work on Triple Stores of just this type. There's a list at the bottom of the Triple Store page of implementations.
As far as java is concerned your algorithms are almost certainly going to be language dependent and if you find a good algorithm implemented in C its java port will also be fast.
Also, what's your data set look like? Are there a lot of 2 matches such that subject and verb are often the same? How many matches are you expecting to get? MapReduce will work work well for finding one match in 10k but won't work as well doing a query that returns a 8k of 10k where the query can't be easily partitioned.
There's a query language made just for this problem too: SPARQL. The bigdata blog has some good insights, though again 10k doesn't seem that large.

Using Binary Trees to find Anagrams

I am currently trying to create a method that uses a binary tree that finds anagrams of a word inputted by the user.
If the tree does not contain any other anagram for the word (i.e., if the key was not in the tree or the only element in the associated linked list was the word provided by the user), the message "no anagram found " gets printed
For example, if key "opst" appears in the tree with an associated linked list containing the words "spot", "pots", and "tops", and the user gave the word "spot", the program should print "pots" and "tops" (but not spot).
public boolean find(K thisKey, T thisElement){
return find(root, thisKey, thisElement);
}
public boolean find(Node current, K thisKey, T thisElement){
if (current == null)
return false;
else{
int comp = current.key.compareTo(thisKey);
if (comp>0)
return find(current.left, thisKey, thisElement);
else if(comp<0)
return find(current.right, thisKey, thisElement);
else{
return current.item.find(thisElement);
}
}
}
While I created this method to find if the element provided is in the tree (and the associated key), I was told not to reuse this code for finding anagrams.
K is a generic type that extends Comparable and represents the Key, T is a generic type that represents an item in the list.
If extra methods I've done are required, I can edit this post, but I am absolutely lost. (Just need a pointer in the right direction)

It's a little unclear what exactly is tripping you up (beyond "I've written a nice find method but am not allowed to use it."), so I think the best thing to do is start from the top.
I think you will find that once you get your data structured in just the right way, the actual algorithms will follow relatively easily (many computer science problems share this feature.)
You have three things:
1) Many linked lists, each of which contains the set of anagrams of some set of letters. I am assuming you can generate these lists as you need to.
2) A binary tree, that maps Strings (keys) to lists of anagrams generated from those strings. Again, I'm assuming that you are able to perform basic operations on these treed--adding elements, finding elements by key, etc.
3) A user-inputted String.
Insight: The anagrams of a group of letters form an equivalence class. This means that any member of an anagram list can be used as the key associated with the list. Furthermore, it means that you don't need to store in your tree multiple keys that point to the same list (provided that we are a bit clever about structuring our data; see below).
In concrete terms, there is no need to have both "spot" and "opts" as keys in the tree pointing to the same list, because once you can find the list using any anagram of "spot", you get all the anagrams of "spot".
Structuring your data cleverly: Given our insight, assume that our tree contains exactly one key for each unique set of anagrams. So "opts" maps to {"opts", "pots", "spot", etc.}. What happens if our user gives us a String that we're not using as the key for its set of anagrams? How do we figure out that if the user types "spot", we should find the list that is keyed by "opts"?
The answer is to normalize the data stored in our data structures. This is a computer-science-y way of saying that we enforce arbitrary rules about how we store the data. (Normalizing data is a useful technique that appears repeatedly in many different computer science domains.) The first rule is that we only ever have ONE key in our tree that maps to a given linked list. Second, what if we make sure that each key we actually store is predictable--that is we know that we should search for "opts" even if the user types "spot"?
There are many ways to achieve this predictability--one simple one is to make sure that the letters of every key are in alphabetical order. Then, we know that every set of anagrams will be keyed by the (unique!) member of the set that comes first in alphabetical order. Consistently enforcing this rule makes it easy to search the tree--we know that no matter what string the user gives us, the key we want is the string formed from alphabetizing the user's input.
Putting it together: I'll provide the high-level algorithm here to make this a little more concrete.
1) Get a String from the user (hold on to this String, we'll need it later)
2) Turn this string into a search key that follows our normalization scheme
(You can do this in the constructor of your "K" class, which ensures that you will never have a non-normalized key anywhere in your program.)
3) Search the tree for that key, and get the linked list associated with it. This list contains every anagram of the user's input String.
4) Print every item in the list that isn't the user's original string (see why we kept the string handy?)
Takeaways:
Frequently, your data will have some special features that allow you to be clever. In this case it is the fact that any member of an anagram list can be the sole key we store for that list.
Normalizing your data give you predictability and allows you to reason about it effectively. How much more difficult would the "find" algorithm be if each key could be an arbitrary member of its anagram list?
Corollary: Getting your data structures exactly right (What am I storing? How are the pieces connected? How is it represented?) will make it much easier to write your algorithms.

What about sorting the characters in the words, and then compare that.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.