Preserve punctuation characters when using Lucene's StandardTokenizer - java

I'm thinking of leveraging Lucene's StandardTokenizer for word tokenization in a non-IR context.
I understand that this tokenizer removes punctuation characters. Would anybody know (or happen to have experience with) making it output punctuation characters as separate tokens?
Example of current behaviour:
Welcome, Dr. Chasuble! => Welcome Dr. Chasuble
Example of desired behaviour:
Welcome, Dr. Chasuble! => Welcome , Dr. Chasuble !

Generally, for custom tokenization of both IR and non-IR content it is a good idea to use ICU (ICU4J is the Java version).
This would be a good place to start:
http://userguide.icu-project.org/boundaryanalysis
The tricky part is preserving the period as part of "Dr.". You would have to use the dictionary based iterator; or, optionally, implement your own heuristic, either in your code or by creating your own iterator, which in ICU can be created as a file with a number of regexp-style definitions.

You can consider using tokenization tool from NLP community instead. Usually such issues have been well taken care of.
Some off-the-shelf tools are stanford corenlp (they have individual components for tokenization as well). UIUC's pipeline should also handle it elegently.
http://cogcomp.cs.illinois.edu/page/software/

Related

Is there an easy and standard way to customize Lucene snowball stemmer?

I'm using Lucene 7.x and ItalianStemmer. I have seen the code of ItalianStemmer class and it seems to take long to be understood. So, I'm looking for a quick (possibly standard) way to customize italian stemmer, without extending ItalianStemmer or SnowballProgram, because I have few days.
The point is that I don't understand why the name "saluto" (greeting) is stemmed to "sal". It should be stemmed to "salut", since the verb "salutare" (greet) is stemmed to "salut". Moreover, "sala" (room) and "sale" (rooms) are also stemmed to "sal", which is confusing, because they have a different meaning.
The standard way would be to copy the source, and create your own.
Stemming is a heuristic process, based on rules. It is designed to generate stems that, while imperfect, are usually good enough to facilitate search. It doesn't have a dictionary of conjugated words and their stems for you to modify. -uto is one of the verb suffixes removed from words by the Italian snowball stemmer, as described here. You could create your own version removing that suffix from the list, but you are probably going to create more problems than you solve, all told.
A tool that returns the correct root word would generally be called a lemmatizer, and I don't believe any come with Lucene, out of the box. The morphological analysis tends to be slower and more complex. If it's important to your use case, you might want to look up an Italian lemmatizer, and work it into a custom filter, or preprocess your text before passing it off the to the analyzer.

Text processing to identify parts of speech

I've to write a program (in Java) to identify several parts of speech like nouns, adjectives, verbs etc. The program should also identify number (in numeric e.g. 10) and numbers written in plain English (ten, hundred etc) and much more. I'm not sure that what is the way forward. Is there any library available that can help? Can this be done only with regex? Or do I need to learn NLP?
Please suggest a way forward.
(1) OpenNLP
(2) LingPipe
(3) Stanford NLP
All 3 of the above (Java based) will help you out, out of the box in identifying the POS.
For numbers use regular expressions.
Part-of-speech (POS) tagging is a pretty standard NLP task. You could in theory write regular expressions that would POS-tag very simple sentences, you're unlikely to achieve reasonable coverage or accuracy with a regex model. You can do pretty well training a reasonably simple HMM model or a discriminative tagger on a hand-tagged training set.
But to tag a specific corpus, you don't necessarily need to learn all the details of POS tagging and roll your own - learning to use an existing library will probably suffice (e.g. NLTK or the Stanford NLP libraries).
Converting textual numeric representations to their arabic form (or vice-versa) falls under the label of 'text-normalization'. Regular expressions (or other finite-state transformations) might be more useful there, although again, you might want to look for an existing solution that meets your needs before you start from scratch.

full text search with spelling changes/mistakes

We have many objects and each objects comes with around 100-200 words description. (for example a book's author name and small summary).
User gives input as series for words. How to implement search with approximate text and minor spelling changes? for example "Joshua Bloch", "Joshua blosh", joshua block" could lead to same text result.
If you are using Lucene for your full-text search, there is a "Did you mean" extension for is probably what you want.
How to implement search with approximate text and minor spelling changes? for example "Joshua Bloch", "Joshua blosh", joshua block" could lead to same text result.
Does your database support Soundex? Soundex will match similar sounding words which seems to fit the example you gave above. Even if your database doesn't have native soundex you can still write an implementation and save the soundex for each author name in a separate field. This can be used to match later.
However Soundex is not a replacement for full text search; it will only help in specific cases likle author name. If you are looking to find some specific text from say, the book's blurb then you are better off with a full text search option (like Postgresql's).
If you are looking for actual implementation of this feature, here is a brilliant program written by Peter Norvig: http://norvig.com/spell-correct.html
It also has links to implementations in many other languages including Java, C etc.
You can use the spell checker JOrtho. From the context in your database you can generate a custom dictionary and set it. Then all words that are not in the dictionary and not in your database are mark as wrong spelling.
Instead of Lucene, please check Solr. Lucene is a library which you can use to embed search function in your application. Solr is the actual implementation of Lucene which you can directly plug in to your application via APIs. For most systems, Solr will save dealing with complexity of Lucene.
Apache Lucene may fit your bill. It is high performance, full test search engine library written entirely in Java.

Spelling correction for data normalization in Java

I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile.
This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee".
I found the following Java libraries for doing spelling correction:
JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens.
APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.)
Any suggestions are welcome!
What you want to implement is not spelling corrector but a fuzzy search. Peter Norvig's essay is a good starting point to build a fuzzy search from candidates checked against a dictionary.
Alternatively have a look at BK-Trees.
An n-gram index (used by Lucene) produces better results for longer words. The approach to produce candidates up to a given edit distance will probably work good enough for words found in normal text but will not work good enough for names, addresses and scientific texts. It will increase you index size, though.
If you have the texts indexed you have your text corpus (your dictionary). Only what is in your data can be found anyway. You need not use an external dictionary.
A good resource is Introduction to Information Retrieval - Dictionaries and tolerant retrieval . There is a short description of context sensitive spelling correction.
With regards to populating a Lucene index as the basis of a spell checker, this is a good way to solve the problem. Lucene has an out the box SpellChecker you can use.
There are plenty of word dictionaries available on the net that you can download and use as the basis for your lucene index. I would suggest supplementing these with a number of domain specific texts as well e.g. if your users are medics then maybe supplement the dictionary with source texts from medical thesis and publications.
Try Peter Norvig's spell checker.
You can hit the Gutenberg project or the Internet Archive for lots and lots of corpus.
Also, I think that the Wiktionary could help you. You can even make a direct download.
http://code.google.com/p/google-api-spelling-java is a good Java spell checking library, but I agree with Thomas Jung, that may not be the answer to your problem.

Online (preferably) lookup API of a word's class

I have a list of words and I want to filter it down so that I only have the nouns from that list of words (Using Java). To do this I am looking for an easy way to query a database of words for their type.
My question is does anybody know of a free, easy word lookup API that would enable me to find the class of a word, not necessarily its semantic definition.
Thanks!
Ben.
EDIT: By class of the word I meant 'part-of-speech' thanks for clearing this up
Word type? Such as verb, noun, adjective, etc? If so, you might run into the issue that some words can be used in more than one way. For example: "Can you trade me that card?", "That was a bad trade."
See this thread for some suggestions.
Have a look at this as well, seems like it might do exactly what you're looking for.
I think what you are looking for is the part-of-speech (POS) of a word. In general that will not be possible to determine except in the context of a sentence. There are many words that have can several different potential parts of speech (e.g. 'bank' can be used as a verb or noun).
You could use a POS tagger to get the information you want. However, the following part-of-speech taggers assume assume that you are tagging words within a well-structured English sentence...
The OpenNLP Java libraries are generally very good and released under the LGPL. There is a part-of-speech tagger for English and a few other languages included in the distribution. Just go to the project page to get the jar (and don't forget to download the models too).
There is also the Stanford part-of-speech tagger, written in Java under the GPL. I haven't had any direct experience with this library, but the Stanford NLP lab is generally pretty awesome.
Querying a database of words is going to lead to the problem that Ben S. mentions, e.g. is it lead (v. to show the way) or lead (n. Pb). If you want to spend some time on the problem, look at Part of Speech tagging. There's some good info in another SO thread.
For English, you could use WordNet with one of the available Java APIs to find the lexical category of a word (which in NLP is most commonly called the part of speech). Using a dedicated POS tagger would be another option.

Categories

Resources