I have a simple application that asks the user for a search term, and then performs a search for that term.
I want to give it the ability to "auto correct" the entered term. This is just for a few terms, that are easy to enter incorrectly. For example, if someone searches for "BestBuy", and what I have in my array to be searched is "Best Buy", I want to automatically convert "BestBuy" into "Best Buy" before the search. I intend to have a list of these in a text file.
What's the best way to do this? Can I have in my text file, every line be something like bestbuy, Best Buy the first item being the entered term and the second one being what it gets autocorrected into? What should I use to store this data? A hashmap?
Edit: Just to clarify, I'm not trying to make an actual auto-correct system. That's way, way beyond the scope of this project. This is just to simply replace certain inputs with "corrected" versions to match what is in the array being searched.
Autocorrect, generally has a harder solution than a hashmap, because you cannot predict the user input, so no point of making a hashmap, though you can use it as a key-existence store.
One possible way: http://en.wikipedia.org/wiki/Levenshtein_distance with the words you have in your map/dictionary, and then selecting the nearest
you could use HashMap, create a method that translate all different types of input into the correct one. You method takes input like bestbuy, best buy, BestBuy,etc and return only Best Buy. The result Best Buy will be the HashMap key.
This is bit tricky. To give a high level idea, you need to maintain a hashMap, with string without spaces as key, and the corrected value as value. when user enters a string, trim() it, check with your map' keyset, and replace with value if found matching entry.
Related
I searched many posts but I didnt find answer. I'd like to search and access values in collection by searching by value. My object type is DictionaryWord with two values: String word and int wordUsage (number of times the word was used). I was wondering which collection would be the fastest. If I write down "wa" I'd like it to give me e.g. 5 strings that start with these letters. Any list or set would be probably way too slow as I have 100 000 objects.
I thought about using a HashMap by making its key values String word and its values int wordUsage. I could even write my own hash() function to just give every key same value after hashing - key: "writing", hash value: "writing". Considering there are no duplicates would it be a good idea or should I look for something else?
My point is: how and what do I use to search for values that have some part of the value used in the search condition. For example writing down "tea" i find in collection values like: "tea", "teacher", "tear", "teaching" etc.
The fastest I can think of is a binary search tree. I found this to be very helpful and it should make it clear why a tree is the best option.
Probably, you need prefix tree. Take a look at Trie wiki page for further information.
I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.
Suppose I have a large list (around 10,000 entries) of string triples as such:
car noun yes
dog noun no
effect noun yes
effect verb no
Suppose I am presented with a string double - for example, (effect, verb) - and I need to quickly look in the list to see if the pair appears and, if it does, whether its value is yes or no. (For this example the double does appear and the value is "no".)
What is the best data structure in Java to store the list and the most efficient way to perform the search? I am running hundreds of thousands of these searches so speed is of the essence.
Thanks!
You might consider using a HashMap<YourDouble, String>. Searches will be O(1).
You could either create an object, YourDouble which holds the first two values, or else append one to the other -- if values will still be unique -- and use HashMap<String, String>.
I would create a HashMultimap for each type of search you want, e.g. "all three", "each pair", and "each single field". When you build the list, populate all the different maps, then you can fetch from whichever map is appropriate for your query.
(The downside is that you'll need a type for at least each arity, e.g. use just String for the "single field" maps, but a Pair for the two-field maps, and a Triple for the three-field map.)
You could use a HashMap where the key is the concatenation of the first two strings, the ones which you'll use for lookups, and the value is a Boolean, representing the yes and no strings.
Alternatively, it seems the words in the second column would be fewer, since they represent categories. You could have a HashMap<String, HashMap<String, Boolean>> where you first index by e.g. "noun", "verb" etc. and then you index by e.g. "car", "dog", "effect", to get to your boolean. This would probably be more space-efficient.
10k doesn't seem that large to me. Have you tried a DB?
The place to look for information like this is the Semantic Web. A number of projects work on Triple Stores of just this type. There's a list at the bottom of the Triple Store page of implementations.
As far as java is concerned your algorithms are almost certainly going to be language dependent and if you find a good algorithm implemented in C its java port will also be fast.
Also, what's your data set look like? Are there a lot of 2 matches such that subject and verb are often the same? How many matches are you expecting to get? MapReduce will work work well for finding one match in 10k but won't work as well doing a query that returns a 8k of 10k where the query can't be easily partitioned.
There's a query language made just for this problem too: SPARQL. The bigdata blog has some good insights, though again 10k doesn't seem that large.
I have a text with about 300 - 500 words. Also i got about 200k keywords and i want to know if each of the keywords is contained in the text. A String contains ist quite slow, is there some way to preprocess the String?
I thought about using a SuffixTree but im not sure this is the best choice.
Also, are there any good librarys for this task? semanticdiscoverytoolkit for example has a suffixtree implementation but after adding the string i cant figure out how to look up if a string is contained in the tree.
Greetings,
Nico
you can try the rabin-karp string search algorithm. since you are doing mostly hash (integer) comparisons, the performance is much better than string comparisons.
compute the hash of the keyword
compute the rolling hash of the text
compare these 2 hashes. if they match, perform the actual string comparison.
advance the position by 1 character and repeat from step 2 until you reach the end of the text.
as a analogy, the rolling hash is like a "sliding window" that scrolls along the text. the hash comparison is done using the hash of the substring in the "sliding window" against the hash of the keyword.
You can use StringTokenizer to get each of the words then populate a hashmap which you check afterwards. This requires going through each list only once. Lookup times should then be very fast which is important given the amount of keywords you have.
It may be worth it to profile this method against something like Lucene.
Hellow Stack Overflow people. I'd like some suggestions regarding the following problem. I am using Java.
I have an array #1 with a number of Strings. For example, two of the strings might be: "An apple fell on Newton's head" and "Apples grow on trees".
On the other side, I have another array #2 with terms like (Fruits => Apple, Orange, Peach; Items => Pen, Book; ...). I'd call this array my "dictionary".
By comparing items from one array to the other, I need to see in which "category" the items from #1 fall into from #2. E.g. Both from #1 would fall under "Fruits".
My most important consideration is speed. I need to do those operations fast. A structure allowing constant time retrieval would be good.
I considered a Hashset with the contains() method, but it doesn't allow substrings. I also tried running regex like (apple|orange|peach|...etc) with case insensitive flag on, but I read that it will not be fast when the terms increase in number (minimum 200 to be expected). Finally, I searched, and am considering using an ArrayList with indexOf() but I don't know about its performance. I also need to know which of the terms actually matched, so in this case, it would be "Apple".
Please provide your views, ideas and suggestions on this problem.
I saw Aho-Corasick algorithm, but the keywords/terms are very likely to change often. So I don't think I can use that. Oh, I'm no expert in text mining and maths, so please elaborate on complex concepts.
Thank you, Stack Overflow people, for your time! :)
If you use a multimap from Google Collections, they have a function to invert the map (so you can start with a map like {"Fruits" => [Apple]}, and produce a map with {"Apple" => ["Fruits"]}. So you can lookup the word and find a list of categories for it, in one call to the map.
I would expect I'd want to split the strings myself and lookup the words in the map one at a time, so that I could do stemming (adjusting for different word endings) and stopword-filtering. Using the map should get good lookup times, plus it's easy to try out.
Would a suffix tree or similar data structure work for your application? It offers O(m) string lookup, where m is the length of the search string, after an O(n2)--or better with some trickery--initial setup, and, with some extra effort, you can associate arbitrary data, such as a reference to a category, with complete words in your dictionary. If you don't want to code it yourself, I believe the BioJava library includes an implementation.
You can also add strings to a suffix tree after initial setup, although the cost will still be around O(n2). That's probably not a big deal if you're adding short words.
If you have only 200 terms to look for, regexps might actually work for you. Of course the regular expression is large, but if you compile it once and just use this compiled Pattern the lookup time is probably linear in the combined length of all the strings in array#1 and I don't see how you can hope for being better than that.
So the algorithm would be: concatenate the words of array#2 which you want to look for into the regular expression, compile it, and then find the matches in array#1 .
(Regular expressions are compiled into a state machine - that is on each character of the string it just does a table lookup for the next state. If the regular expression is complicated you might have backtracking that increases the time, but your regular expression has a very simple structure.)