My requirement is to be able to match two strings that are similar but not an exact match.
For example, given the following strings
First Name
Last Name
LName
FName
The output should be FirstName, FName and Last Name, LName as they are a logical match. Are there any libraries that I could use to do this? I am using JAVA to achieve this functionality.
Thanks
Raam
You could use Apache Commons StringUtils...
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#getLevenshteinDistance(java.lang.CharSequence,%20java.lang.CharSequence)
But it's worth noting that this may not be the best algorithm for the specific use-case in the question - I recommend reading some of the other answers here for more ideas.
According to the example you gave, you should use a modified Levenshtein distance where the penalty for adding spaces is small and the penalty for mismatched characters is larger. This will handle matching abbreviations to the strings that were abbreviated quite well. However that's assuming that you're mainly dealing with aligning abbreviations to corresponding longer versions of the strings. You should elaborate more exactly what kind of matchings you want to perform (e.g. more examples, or some kind of high-level description) if you want a more detailed and pointed answer about what methods you can/should use.
StringUtils is simply best for this - this is one of the examples i found on stackOverflow - as #CupawnTae said already
Below is the one of the simple example i came across
public static Object getTheClosestMatch(Collection<?> collection, Object target) {
int distance = Integer.MAX_VALUE;
Object closest = null;
for (Object compareObject : collection) {
int currentDistance = StringUtils.getLevenshteinDistance(compareObject.toString(), target.toString());
if(currentDistance < distance) {
distance = currentDistance;
closest = compareObject;
}
}
return closest;
}
An answer to a really similar question to yours can be found here.
Also, wikipedia has an article on Approximate String Matching that can be found here. If the first link isn't what you're looking for, I would suggest reading the wikipedia article and digging through the sources to find what you need.
Sorry I can't personally be of more help to you, but I really hope that these resources can help you find what you're looking for!
The spell check algorithms use a variant of this algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance. I implemented it in class for a project and it was fairly simple to do so. If you don't want to implement it yourself you can use the name to search for other libraries.
Related
I have an array list with some names inside it (first and last names). What I have to do is go through each "first name" and see how many times a character (which the user specifies) shows up at the end of every first name in the array list, and then print out the number of times that character showed up.
public int countFirstName(char c) {
int i = 0;
for (Name n : list) {
if (n.getFirstName().length() - 1 == c) {
i++;
}
}
return i;
}
That is the code I have. The problem is that the counter (i) doesn't add 1 even if there is a character that matches the end of the first name.
You're comparing the index of last character in the string to the required character, instead of the last character itself, which you can access with charAt:
String firstName = n.getFirstName()
if (firstName.charAt(firstName.length() - 1) == c) {
i++;
}
When you're setting out learning to code, there is a great value in using pencil and paper, or describing your algorithm ahead of time, in the language you think in. Most people that learn a foreign language start out by assembling a sentence in their native language, translating it to foreign, then speaking the foreign. Few, if any, learners of a foreign language are able to think in it natively
Coding is no different; all your life you've been speaking English and thinking in it. Now you're aiming to learn a different pattern of thinking, syntax, key words. This task will go a lot easier if you:
work out in high level natural language what you want to do first
write down the steps in clear and simple language, like a recipe
don't try to do too much at once
Had I been a tutor marking your program, id have been looking for something like this:
//method to count the number of list entries ending with a particular character
public int countFirstNamesEndingWith(char lookFor) {
//declare a variable to hold the count
int cnt = 0;
//iterate the list
for (Name n : list) {
//get the first name
String fn = n.getFirstName();
//get the last char of it
char lc = fn.charAt(fn.length() - 1);
//compare
if (lc == lookFor) {
cnt++;
}
}
return cnt;
}
Taking the bullet points in turn:
The comments serve as a high level description of what must be done. We write them aLL first, before even writing a single line of code. My course penalised uncommented code, and writing them first was a handy way of getting the requirement out of the way (they're a chore, right? Not always, but..) but also it is really easy to write a logic algorithm in high level language, then translate the steps into the language learning. I definitely think if you'd taken this approach you wouldn't have made the error you did, as it would have been clear that the code you wrote didn't implement the algorithm you'd have described earlier
Don't try to do too much in one line. Yes, I'm sure plenty of coders think it looks cool, or trick, or shows off what impressive coding smarts they have to pack a good 10 line algorithm into a single line of code that uses some obscure language features but one day it's highly likely that someone else is going to have to come along to maintain that code, improve it or change part of what it does - at that moment it's no longer cool, and it was never really a smart thing to do
Aominee, in their comment, actually gives us something like an example of this:
return (int)list.stream().filter(e -> e.charAt.length()-1)==c).count();
It's a one line implementation of a solution to your problem. Cool huh? Well, it has a bug* (for a start) but it's not the main thrust of my argument. At a more basic level: have you got any idea what it's doing? can you look at it and in 2 seconds tell me how it works?
It's quite an advanced language feature, it's trick for sure, but it might be a very poor solution because it's hard to understand, hard to maintain as a result, and does a lot while looking like a little- it only really makes sense if you're well versed in the language. This one line bundles up a facility that loops over your list, a feature that effectively has a tiny sub method that is called for every item in the list, and whose job is to calculate if the name ends with the sought char
It p's a brilliant feature, a cute example and it surely has its place in production java, but it's place is probably not here, in your learning exercise
Similarly, I'd go as far to say that this line of yours:
if (n.getFirstName().length() - 1 == c) {
Is approaching "doing too much" - I say this because it's where your logic broke down; you didn't write enough code to effectively implement the algorithm. You'd actually have to write even more code to implement this way:
if (n.getFirstName().charAt(n.getFirstName().length() - 1) == c) {
This is a right eyeful to load into your brain and understand. The accepted answer broke it down a bit by first getting the name into a temporary variable. That's a sensible optimisation. I broke it out another step by getting the last char into a temp variable. In a production system I probably wouldn't go that far, but this is your learning phase - try to minimise the number of operations each of your lines does. It will aid your understanding of your own code a great deal
If you do ever get a penchant for writing as much code as possible in as few chars, look at some code golf games here on the stack exchange network; the game is to abuse as many language features as possible to make really short, trick code.. pretty much every winner stands as a testament to condense that should never, ever be put into a production system maintained by normal coders who value their sanity
*the bug is it doesn't get the first name out of the Name object
I'm looking for a tool that would compare two text strings and return a result being in fact the indicator of their similarity (e.g. 95%). It needs to be implemented on a platform supporting Java libraries.
My best guess is that I need some fuzzy logic comparison tool that would do the fuzzy match and then return the similarity level.
I've seen some posts here related to fuzzy search but I need the exact opposite - meaning I don't want to set some parameters and have similar entries returned. Instead I have the entries on hand but need to have those similarity parameter derived from them...
Can you advise me on that? Many thanks
Apache's StringUtils has something called Levenshtein distance indicator.
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringUtils.html
Levenshstein distance is an algorithm that outputs the similarity based on "edit distance". Although I'm not sure if this is "fuzzy".
Example:
int distance = StringUtils.getLevenshteinDistance("cat", "hat");
There is now a library that does exactly that
https://github.com/intuit/fuzzy-matcher
I am in the middle of writing some code to filter sentences into different groups.
The sentences are formed from the descriptions of incident tickets that my servicedesk have processed.
I have to filter them based on 5 catergories; Laptop,Telephony,Network, Printer,Application.
An example of a description from the application catergory is: "Please can you install CMS on XXXX YYYYYYY laptop"
I understand that it is impossible to get this perfect. But I was wondering what the best way to tackle this is? As you can see from the example it falls into the application category but contains a keyword "laptop".
If theres any more information I can provide you with please let me know. Every little helps. Thanks
Maintain different list or queues for different categories.
When you receive sentence, check for keyword occurrence in that sentence and add/push to appropriate list/queue.
you can maintain a map which tells you which list/queue for which keyword.
Interesting question! As seen in your example, there can be multiple keywords within the same sentence, making it difficult to decipher which category the sentence will belong to.
In order to get around this, I would suggest possibly using a separate priority queue for each category, containing keywords for each category in order of priority.
For example, you would have a priority queue of keywords for the Application category, and (within that priority queue) "install" would be of higher priority than "laptop" or "computer", because "install" is more closely related to applications than "laptop".
In your algorithm for choosing which category a sentence is part of, I would do a round-robin search through all five priority queues until a match is found - the highest priority match out of all five categories takes the sentence. This is one possible solution I can think of.
NOTE: For this to work properly, of course it is important to pick and choose carefully which keywords go into which categories; for example, in the Laptop category, it may seem natural to have "laptop" be the highest priority keyword - however, this would cause lots of collisions because laptop will probably be a very commonly used word in sentences. You should have very specific keywords pertaining to each category, rather than having broad/surface level keywords like "laptop" (or have "laptop" be a very low priority keyword).
This is actually a machine learning problem (text categorization) that you could solve using several algorithms: support vector machines, multinomial logistic regression, naive bayes and more.
There are many libraries which will help you, here is one (java)
http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
Also python has a very good library:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier
If you want to take this approach, you are going to need a training dataset, meaning that you need to manually label a set of documents that the algorithm will use to automatically learn which keywords are important.
Hope it helps!
If you only have the reach from receiving these sentences and sending/doing logic,
why not just filter them by regex?
See for example,
Regex to find a specific word in a string in java
e.g.
List<String> LaptopList = new ArrayList<String>();
for (String item : sentenceList) {
if item.matches(".*\\blaptop\\b.*"){
LaptopList.add(item);
}
}
You are looking at the keyword "Laptop". But there is a keyword install "install" which primary tells about installation of some application.
So you can try like
if( sentence.contains("install") || (sentence.contains("install") && sentence.contains("laptop") )
{
applicationTickets.add(sentence);
}
else if(sentence.contains("laptop") || other conditions)
{
laptopTickets.add(sentence);
}
else if( )
..........
else if( )
..........
If you observe the code, the applications category is placed first because It is matching with the terms of Laptop. So through this code trying to fall that sentence into laptop category.
You can use loops for checking all the conditions. The keywords can be added to the specify list for every category.
I have a class that is doing a lot of text processing. For each string, which is anywhere from 100->2000 characters long, I am performing 30 different string replacements.
Example:
string modified;
for(int i = 0; i < num_strings; i++){
modified = runReplacements(strs[i]);
//do stuff
}
public runReplacements(String str){
str = str.replace("foo","bar");
str = str.replace("baz","beef");
....
return str;
}
'foo', 'baz', and all other "targets" are only expected to appear once and are string literals (no need for an actual regex).
As you can imagine, I am concerned about performance :)
Given this,
replaceFirst() seems a bad choice because it won't use Pattern.LITERAL and will do extra processing that isn't required.
replace() seems a bad choice because it will traverse the entire string looking for multiple instances to be replaced.
Additionally, since my replacement texts are the same everytime, it seems to make sense for me to write my own code otherwise String.replaceFirst() or String.replace() will be doing a Pattern.compile every single time in the background. Thinking that I should write my own code, this is my thought:
Perform a Pattern.compile() only once for each literal replacement desired (no need to recompile every single time) (i.e. p1 - p30)
Then do the following for each pX: p1.matcher(str).replaceFirst(Matcher.quoteReplacement("desiredReplacement"));
This way I abandon ship on the first replacement (instead of traversing the entire string), and I am using literal vs. regex, and I am not doing a re-compile every single iteration.
So, which is the best for performance?
So, which is the best for performance?
Measure it! ;-)
ETA: Since a two word answer sounds irretrievably snarky, I'll elaborate slightly. "Measure it and tell us..." since there may be some general rule of thumb about the performance of the various approaches you cite (good ones, all) but I'm not aware of it. And as a couple of the comments on this answer have mentioned, even so, the different approaches have a high likelihood of being swamped by the application environment. So, measure it in vivo and focus on this if it's a real issue. (And let us know how it goes...)
First, run and profile your entire application with a simple match/replace. This may show you that:
your application already runs fast enough, or
your application is spending most of its time doing something else, so optimizing the match/replace code is not worthwhile.
Assuming that you've determined that match/replace is a bottleneck, write yourself a little benchmarking application that allows you to test the performance and correctness of your candidate algorithms on representative input data. It's also a good idea to include "edge case" input data that is likely to cause problems; e.g. for the substitutions in your example, input data containing the sequence "bazoo" could be an edge case. On the performance side, make sure that you avoid the traps of Java micro-benchmarking; e.g. JVM warmup effects.
Next implement some simple alternatives and try them out. Is one of them good enough? Done!
In addition to your ideas, you could try concatenating the search terms into a single regex (e.g. "(foo|baz)" ), use Matcher.find(int) to find each occurrence, use a HashMap to lookup the replacement strings and a StringBuilder to build the output String from input string substrings and replacements. (OK, this is not entirely trivial, and it depends on Pattern/Matcher handling alternates efficiently ... which I'm not sure is the case. But that's why you should compare the candidates carefully.)
In the (IMO unlikely) event that a simple alternative doesn't cut it, this wikipedia page has some leads which may help you to implement your own efficient match/replacer.
Isn't if frustrating when you ask a question and get a bunch of advice telling you to do a whole lot of work and figure it out for yourself?!
I say use replaceAll();
(I have no idea if it is, indeed, the most efficient, I just don't want you to feel like you wasted your money on this question and got nothing.)
[edit]
PS. After that, you might want to measure it.
[edit 2]
PPS. (and tell us what you found)
Please help me how to perform word clustering using k-means algorithm in java. From the set of documents, I get word and its frequency count. Then i dont know how to start for clustering.I already search google. But no idea. Please tell me steps to perform word clustering. Very needful now. Thanks in advance.
"Programming Collective Intelligence" by Toby Segaran has a wonderful chapter on how to do this. The examples are in Python, but they should be easy to port to Java.
In clustering most important thing is to build a method, which check how to things (for example) are "close" together. E.g. is you are interested in string with same lang, this could be like:
int calculateDistance(String s1, String s2) {
return Math.abs(s1.length() - s2.length());
}
Then I'm not so sure, but in can be like this:
1. choose (can be randomly) first k string,
2. iterate for all string, and relate them to their "nearest" string.
Then can be something, like choosing from every "cluster" middle of it, and start it again. I don't remember it for 100% but I thing it is good way to start.
And remember, that most important is the method calculateDistance()!