I am trying to device an algorithm that performs error correction in names. My approach is having a database with the correct names, compute edit distance between each of them and the name entered and then suggest the 5 or 10 closest.
This task is significantly different from standard error correction in words as some of the names might be replaced by initials. For instance "Jonathan Smith" and "J. Smith" are actually quite close and could easily be considered the same name, so the edit distance should be really small if not 0. Another challenge is that some names might be written differently while sounding the same. For instance Shnaider and Schneider are versions of the same name written by people with different locales(there are better examples for that I guess). And another case - just imagine all the possible errors in writing Jawaharlal Nehru most of which have nothing to do with the real name. Again probably most of them will be similar phonetically.
Obviously Lucene's error correction algorithm will not help me here as it does not handle the above cases.
So my question is: do you know any library capable of doing error correction in names? Can you propose some algorithm for handling the cases mentioned above?
I am interested in libraries in c++ or java. As for algorithm proposals any language or pseudo code will do.
For phonetic matching, see Soundex.
I think modifying a Levenshtein distance algorithm to treat "abbreviate to an initial" and "expand from an initial" as single-distance edits ought to be straightforward, but the details are beyond me at the moment.
You might also look at Metaphone.
Related
I was looking for lightweight library that'd allow me to feed it a bunch of words, and then ask it whether a given word would have any close matches.z
I'm not particularly concerned with the underlying algorithm (I reckon a simple hamming distance algorithm would probably suffice, were I to undertake the task myself).
I'm just in the development of a small language and I found it nifty to make suggestions to the user when an "Undefined class" error is detected (lots of times it's just a misspelled word). I don't want to lose much time on the issue though.
Thanks
Levenshtein distance is a common way of handling it. Just add all the words to a list and then brute-force iterate over it and return the smallest distance. Here's one library with a Levenschtein function: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringUtils.html
If you have a large number of words and you want it to run fast, then you'd have to use ngrams. Spilt each word into bigrams and then add (bigram, word) to a map. Use the map to look up the bigrams in the target word, and then iterate through the candidates. That's probably more work than you want to do, though.
not necessarily a library but i think this article may be really helpful. it mostly describes the general workings of how a spelling corrector works in python, but also has a link for a java implementation which you may use if that is what you are looking for specifically (note that I haven't specifically used the java one before)
I want to implement a OCR system. I need my program to not make any mistakes on the letters it does choose to recognize. It doesn't matter if it cannot recognize a lot of them (i.e high precision even with a low recall is Okay).
Can someone help me choose a suitable ML algorithm for this. I've been looking around and find some confusing things. For example, I found contradicting statements about SVM. In the scikits learn docs, it was mentioned that we cannot get probability estimates for SVM. Whereas, I found another post that says it is possible to do this in WEKA.
Anyway, I am looking for a machine learning algorithm that best suites this purpose. It would be great if you could suggest a library for the algorithm as well. I prefer Python based solutions, but I am OK to work with Java as well.
It is possible to get probability estimates from SVMs in scikit-learn by simply setting probability=True when constructing the SVC object. The docs only warn that the probability estimates might not be very good.
The quintessential probabilistic classifier is logistic regression, so you might give that a try. Note that LR is a linear model though, unlike SVMs which can learn complicated non-linear decision boundaries by using kernels.
I've seen people using neural networks with good results, but that was already a few years ago. I asked an expert colleague and he said that nowadays people use things like nearest-neighbor classifiers.
I don't know scikit or WEKA, but any half-decent classification package should have at least k-nearest neighbors implemented. Or you can implement it yourself, it's ridiculously easy. Give that one a try: it will probably have lower precision than you want, however you can make a slight modification where instead of taking a simple majority vote (i.e. the most frequent class among the neighbors wins) you require larger consensus among the neighbors to assign a class (for example, at least 50% of neighbors must be of the same class). The larger the consensus you require, the larger your precision will be, at the expense of recall.
I now have an idea, that I use the function as a string, and I calculate the real integral by hand, and ask a question to the user what the definite integral is, but that isn't a real solution.
I was wondering if there was a way to input a function and output an integral/derivative (depending on user choice). My initial step was to put it into an array somehow, but given the many types of functions, this wasn't happening.
I researched everywhere, and I haven't found a method that actually does this with no additional code, nor any code that actually does this, period.
Also, I want to see if there was a way to make a GUI interface and plot inputted functions on to that, if that's possible too.
Thanks :)
What you're describing is known as symbolic integration. There's currently no fully general way to implement it, but there are some techniques available. One such is the Risch algorithm.
Alternatively, an easier problem than symbolic integration is [symbolic differentiation -- and, if the differential of the user's input is equivalent* to the expression which they were asked to integrate, then their integral is probably correct.
You may also want to consider using an existing CAS**, such as Mathematica, to implement this. They've already implemented most of the tools you're after.
*: Keep in mind, though, that two mathematical expressions may be equivalent without being identical, either in trivial ways (e.g, terms in a different order), more complex ones (e.g, large expressions factored differently), or fundamentally (e.g, trig functions replaced with complex exponentials or vice versa).
**: Computer algebra system
Javacalculus is what you are looking for.
Good luck!
I have an application in Java/JSF, I need to do some optimization calculations, like Excel Solver Add-in does, one option is certainly to write my own solver implementation, but I'm kind of short of time, so I'm looking into libraries that already exist that can help me with this.
Can you recommend any libraries?
EDITED
I don't have the algorithm yet, but I know that I will have to do similar actions like in Excel Solver - defining parameters, the goal and restrictions and calculation the MAX/MIN revenue
Not a complete solution, but this may get you on the right track (what you are looking for is a non-linear parametric optimizer/solver):
http://jfuzzylogic.sourceforge.net/html/index.html
I did some Googling, and I was surprised that I wasn't able to find something right away...
Here is info about Excel's specific algorithm: http://support.microsoft.com/kb/82890 (again, not a solution, but certainly interesting information for anyone who does this sort of thing).
And here's the company that actually wrote the Excel solver: http://www.solver.com/sdkplatform2.htm
Not sure what your budget is, but if time is of the essence, it may make sense to license it (not sure if they have a Java version of their sdk or not).
And a related question at SO: Solving nonlinear equations numerically
Given the source code of a program, how do I analyze it and count the function points within it?
Thanks!
You might find this tutorial on FPA of interest. Personally, I don't put much stock in this estimation method. From my perspective it attempts to provide a precise estimate in for things that have been shown repeatedly to not be precisely measurable. I much prefer planning poker or something similar that tries to group things within a similar order of magnitude and provide an estimate based on your previous estimations for similarly sized stories.
If you're doing this for a class, simply follow the rules given in the text book and crank out the answer. If you're really intending to try this as a software development estimation method, my advice is to simplify the process rather than make it more complex. I would imagine that members of the International Function Point User Group (yes, there is one), will disagree.
With a code analysis tool. If you want to write one yourself, you might want to start with cglib or ASM.