Java: how to validate natural language text - java

I'm using OCR to recognize (German) text in an image. It works well but not perfectly. Sometimes a word gets messed-up. Therefore, I want to implement some sort of validation. Of course, I can just use a word list and find words that are similar to the messed-up word, but is there a way to check if the sentence is plausible with these words?
After all, my smartphone can give me good suggestions on how to complete a sentence.

You need to look for Natural Language Processing (NLP) solutions. With them, you can validate syntactically the lexical (either the whole text, which may be better as some of them may take on consideration the context, or phrase by phrase).
I am not an expert in the area, but this article can help you to choose a tool to start trying.
Also, please notice: your keyboard on your cellphone is developed and maintained by specialized teams, either on Apple, Google or any other company that you use their app. So, please, don't underestimate this task: there are dozens of research areas on this, that includes either software engineers and linguistics specialists to achieve proper results.
Edit: well, two days later, I've just came to this link: https://medium.com/quick-code/12-best-natural-language-processing-courses-2019-updated-2a6c28aebd48

Related

Parsing Street Address Using RegEx

I know there are many questions asked on this topic. I am trying to parse and fetch street addresses from html page. The format of these page do not follow any patterns. Can someone help me in comming up with a regex that would match a street address, irrespective of the number of tags between them? Are there any other ways to do this other than using regular expressions?
Before you get all traditional let me share my experience. I've parsed over 1 million web pages in this way in Java. When I need small pieces out of a page it is perfect when paired with a replace to strip tags. In fact, it is more efficient and faster, especially when using Java's great replaceAll() function to strip tags. Build a fork join pool of both and test some parsing, you won't believe your eyes. I've added that part at the end. This is not the full regex but a starting point since it would take some trial and error to build. I believe the statement was, a bunch of pages with no clear route to the address.
So, yes, there are ways. What follows is a bit of an introduction to thinking about this in regex.
Words and groups of words are always in a pattern otherwise they aren't readable. Still, there are several things to note. Addresses can very greatly so it is important to continue building out a regex. The next thing, if you have access to a CAS engine, use it for anything you get. It standardizes your address.
As a must, have you tried xml, it will narrow everything and can help get rid of tags before you format. You need to narrow everything. If you are using java or python, run this step in a ForkJoinPool or MultiprocessingPool.
Your process should be:
Narrow if possible
Execute a regex that exploits formatting
Lastly, here is a regex cheat sheet.
Keep in mind. I don't know what websites you are using or their formats. I have personally had to pull this data with different per site regexes but that was for odd formats and other issues present with websites that run like databases of a certain variety.
That said, an address has a format of numbers, then street address and apartment number of pretty much anything, then city, state, then zip code. Basically it is \d+ then any combination of letters and numbers.
So (in java with double backslashes) to start you off:
[\\d]+[A-Za-z0-9\\s,\\.]+
If you want to start at but exclude tags to narrow your search if not using xml, use:
(?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=end)
Html pages always seem to have tags so that would be something like
(?<=>)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=<)
You may be able to use a zip code as your ending place if there is a multi-part zipcode.
[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+
As a final note, you can chain together regexes with a pipe delimeter, e.g.:
(?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+|(?<=start)[A-Za-z0-9\\s,\\.]+?(?=end)
If this is not narrow enough there are several additional steps:
compare your results (average word length and etc.) and throw out any great outliers
write a formatter script per site to do cleanup that uses single or multi-threading to replace what you don't need.
You will probably need to strip out html as well. Run this regex in a replace statement to do that.
<.*?>
If you have trouble, use something like my regex tester (the website not my own) to build your regex.
Having worked on this problem quite extensively at SmartyStreets, I will tell you "NO" to parsing/finding street addresses with a regex.
Addresses are not a regular language and cannot be matched by a regular expression.
To solve the problem, we developed an API which actually finds and extracts addresses, with notably high accuracy. It's free for low-volume use. (It was not an easy problem to solve.) You can try it for free on the homepage demo. And no, this is not a solicitation. If you want to learn more about street addresses in any amount of detail from very basic to very technical, just email us because we want to educate the community about addresses.
To extract addresses, there are regular expressions under the hood, but results are biased strongly toward those which actually verify, meaning which actually exist. In other words, this is a parser performing complex operations to find and match addresses.
This answer to a very similar question is related, and you may find it useful. The other answers highlight some important points about the difficulties and solutions for parsing street addresses...

Lightweight library cappable of suggesting different spellings of words from a bounded set?

I was looking for lightweight library that'd allow me to feed it a bunch of words, and then ask it whether a given word would have any close matches.z
I'm not particularly concerned with the underlying algorithm (I reckon a simple hamming distance algorithm would probably suffice, were I to undertake the task myself).
I'm just in the development of a small language and I found it nifty to make suggestions to the user when an "Undefined class" error is detected (lots of times it's just a misspelled word). I don't want to lose much time on the issue though.
Thanks
Levenshtein distance is a common way of handling it. Just add all the words to a list and then brute-force iterate over it and return the smallest distance. Here's one library with a Levenschtein function: http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringUtils.html
If you have a large number of words and you want it to run fast, then you'd have to use ngrams. Spilt each word into bigrams and then add (bigram, word) to a map. Use the map to look up the bigrams in the target word, and then iterate through the candidates. That's probably more work than you want to do, though.
not necessarily a library but i think this article may be really helpful. it mostly describes the general workings of how a spelling corrector works in python, but also has a link for a java implementation which you may use if that is what you are looking for specifically (note that I haven't specifically used the java one before)

converting audio file into text file using java

i am developing a desktop application using java. this application is for school kid to teach English, where user can upload some English audio can be in any format which need to be converted into text file. where they can read the text.
I've found some api but i am not sure about them.
http://cmusphinx.sourceforge.net/wiki/
I've seen many question on stackoverflow regarding this but none was helpful. if someone can help on this will be very greatful
thank you
There are many technologies and services available to perform speech recognition. For an intro to some of the choices see https://stackoverflow.com/a/6351055/90236.
I'm not sure that the results will be acceptable for teaching children English as a second language, but it is worth trying.
What you seek is currently breaking edge technology. Tools like cmusphinx can detect words from a dedicated, limited dictionary (so you can teach it to understand, say, 15 words and that's it - you can't teach it to understand English).
Basically, those tools try to find patterns in the sound waves that you feed them. They don't understand anything, they just use the same algorithm on anything and then try to find the closest match. This works well for small sets of words but as the number of words increases, the difference between then shrinks and the jobs gets ever harder (without even starting with words like whether and weather or C and see).
What you might consider is "repeat after me" software. Here, you need to record all words for the test as templates. Then you can record the words from the pupils and then compute the difference. If the difference is not too large, the word is correct. But again: This is simple repetition to improve pronunciation - not English.
There is desktop software which can understand a lot of English (for example the products from Nuance, Dragon Naturally Speaking being one of the most prominent). They do offer server solutions but that software isn't free or cheap if you're on a tight budget.

Spelling correction for data normalization in Java

I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile.
This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee".
I found the following Java libraries for doing spelling correction:
JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens.
APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.)
Any suggestions are welcome!
What you want to implement is not spelling corrector but a fuzzy search. Peter Norvig's essay is a good starting point to build a fuzzy search from candidates checked against a dictionary.
Alternatively have a look at BK-Trees.
An n-gram index (used by Lucene) produces better results for longer words. The approach to produce candidates up to a given edit distance will probably work good enough for words found in normal text but will not work good enough for names, addresses and scientific texts. It will increase you index size, though.
If you have the texts indexed you have your text corpus (your dictionary). Only what is in your data can be found anyway. You need not use an external dictionary.
A good resource is Introduction to Information Retrieval - Dictionaries and tolerant retrieval . There is a short description of context sensitive spelling correction.
With regards to populating a Lucene index as the basis of a spell checker, this is a good way to solve the problem. Lucene has an out the box SpellChecker you can use.
There are plenty of word dictionaries available on the net that you can download and use as the basis for your lucene index. I would suggest supplementing these with a number of domain specific texts as well e.g. if your users are medics then maybe supplement the dictionary with source texts from medical thesis and publications.
Try Peter Norvig's spell checker.
You can hit the Gutenberg project or the Internet Archive for lots and lots of corpus.
Also, I think that the Wiktionary could help you. You can even make a direct download.
http://code.google.com/p/google-api-spelling-java is a good Java spell checking library, but I agree with Thomas Jung, that may not be the answer to your problem.

Online (preferably) lookup API of a word's class

I have a list of words and I want to filter it down so that I only have the nouns from that list of words (Using Java). To do this I am looking for an easy way to query a database of words for their type.
My question is does anybody know of a free, easy word lookup API that would enable me to find the class of a word, not necessarily its semantic definition.
Thanks!
Ben.
EDIT: By class of the word I meant 'part-of-speech' thanks for clearing this up
Word type? Such as verb, noun, adjective, etc? If so, you might run into the issue that some words can be used in more than one way. For example: "Can you trade me that card?", "That was a bad trade."
See this thread for some suggestions.
Have a look at this as well, seems like it might do exactly what you're looking for.
I think what you are looking for is the part-of-speech (POS) of a word. In general that will not be possible to determine except in the context of a sentence. There are many words that have can several different potential parts of speech (e.g. 'bank' can be used as a verb or noun).
You could use a POS tagger to get the information you want. However, the following part-of-speech taggers assume assume that you are tagging words within a well-structured English sentence...
The OpenNLP Java libraries are generally very good and released under the LGPL. There is a part-of-speech tagger for English and a few other languages included in the distribution. Just go to the project page to get the jar (and don't forget to download the models too).
There is also the Stanford part-of-speech tagger, written in Java under the GPL. I haven't had any direct experience with this library, but the Stanford NLP lab is generally pretty awesome.
Querying a database of words is going to lead to the problem that Ben S. mentions, e.g. is it lead (v. to show the way) or lead (n. Pb). If you want to spend some time on the problem, look at Part of Speech tagging. There's some good info in another SO thread.
For English, you could use WordNet with one of the available Java APIs to find the lexical category of a word (which in NLP is most commonly called the part of speech). Using a dedicated POS tagger would be another option.

Categories

Resources