Match words like recruit, recruiter and recruitment in Java - java

I want to write a code to match certain words. I don't care about the form of the word, it could be a noun and adding -ing to it, it can become a verb. Eg, add = adding, recruit = recruiting. Also, like recruit = recruitment = recruiter.
In simple words, all forms of the words are equal. Is there any Java program that I can use to achieve this.
I am somewhat familiar to Apache's OpenNLP, so if that could help in any way?
Thanks!!

It sounds like you want a stemmer or lemmatizer. You might want to check out Stanford CoreNLP which includes a lemmatizer. You might also want to try the Porter Stemmer.
My guess is that these will cover some of the cases but not all of them. For example "recruitment" won't be lemmatized to "recruit." For that, you'd need a more complex morphological analyzer but I don't know of a good existing system.

Related

StanfordNLP lemmatization cannot handle -ing words

I've been experimenting with Stanford NLP toolkit and its lemmatization capabilities. I am surprised how it lemmatize some words. For example:
depressing -> depressing
depressed -> depressed
depresses -> depress
It is not able to transform depressing and depressed into the same lemma. Simmilar happens with confusing and confused, hopelessly and hopeless. I am getting the feeling that the only thing it is able to do is remove the s if the word is in such form (e.g. feels -> feel). Is such behaviour normal for Lematizatiors in English? I would expect that they would be able to transform such variations of common words into a same lemma.
If this is normal, should I rather use stemmers? And, is there a way to use stemmers like Porter (Snowball, etc.) in StanfordNLP? There is no mention of stemmers in their documentation; however, there are some CoreAnnotations.StemAnnotation in the API. If not possible with StanfordNLP which stemmers do you recommend for use in Java?
Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma.
In the sentence "This is confusing", confusing is analyzed as an adjective, and therefore it is lemmatized to confusing. In the sentence "I was confusing you with someone else", by contrast, confusing is analyzed as a verb, and is lemmatized to confuse.
If you want tokens with different parts of speech to be mapped to the same lemma, you can use a stemming algorithm such as Porter Stemming, which you can simply call on each token.
An adding to yvespeirsman's answer:
I see that, when applying lemmatization, we should make sure that the text keeps its punctuation, that is, punctuation removal must come before lemmatization, since the lemmatizer takes into account type of the words (part of speech) when performing its task.
Notice the words confuse and confusing in the examples below.
With punctuation:
for token in nlp("This is confusing. You are confusing me."):
print(token.lemma_)
Output:
this
be
confusing
.
-PRON-
be
confuse
-PRON-
.
Without punctuation:
for token in nlp("This is confusing You are confusing me"):
print(token.lemma_)
Output:
this
be
confuse
-PRON-
be
confuse
-PRON-

Lucene get list of matched keywords

I have a Java (lucene 4) based application and a set of keywords fed into the application as a search query (the terms may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).
I need a way to get the list of matched keywords out of an indexed document and possibly also get keyword positions in the document (also for the multi-word keywords).
I tried with the lucene highlight package but I need to get only the keywords without any surrounding portion of text. It also returns multi-word keywords in separate fragments.
I would greatly appreciate any help.
There's a similar (possibly same) question here:
Get matched terms from Lucene query
Did you see this?
The solution suggested there is to disassemble a complicated query into a more simple query, until you get a TermQuery, and then check via searcher.explain(query, docId) (because if it matches, you know that's the term).
I think It's not very efficient, but
it worked for me until I ran into SpanQueries. it might be enough for you.

Identifying all the names from a given text

I want to identify all the names written in any text, currently I am using IMDB movie reviews.
I am using stanford POS tagger, and analysing all the proper nouns (as proper noun are names of person,things,places), but this is slow.
Firstly I am tagging all the input lines, then I am checking for all the words with NNP in the end, which is a slow process.
Is there any efficient substitute to achieve this task? ANy library (preferably in JAVA).
Thanks.
Do you know the input language? If yes you could match each word against a dictionnary and flag the word as proper noun if it is not in the dictionnary. It would require a complete dictionnary with all the declensions of each word of the language, and pay attention to numbers and other special cases.
EDIT: See also this answer in the official FAQ: have you tried to change the model used?
A (paid) web service called GlobalNLP can do it in multiple languages: https://nlp.linguasys.com/docs/services/54131f001c78d802f0f2b28f/operations/5429f9591c78d80a3cd66926

Lucene : Use SpanTermQuery to get results for words with special characters

Is it possible to search for results in Lucene for non-character words, for example if I am trying to find results for "word-processing" or "foo-bar". It doesn't look like they are considered as single term, while using SpanTermQuery. I get results for it using QueryParser but not SpanTermQuery. I am just wondering how it works, Any comments/ Ideas on how to have SpanTermQuery work for it?
I would recommend taking a look at how your field's Tokenizers and Analyzers are configured. Read the javadocs for the existing out of the box Tokenizers/Analyzers to see if one of them meet your needs. If one doesn't meet your needs, it's pretty easy to extend and create your own Tokenizer and/or Analyzer.
http://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_write_my_own_Analyzer.3F
http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/analysis/Analyzer.html
http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/analysis/Tokenizer.html

Create a dataset: extract features from text documents (TF-IDF)

I've to create a dataset from some text files, writing them as vectors of features.
Something like this:
doc1: 1,0.45 6,0.001 94,0.1 ...
doc2: 3,0.5 98,0.2 ...
...
each position of the vector represent a word, and the score is given by something like TF-IDF.
Do you know some library/tool/whatever for this? (java is better)
After some days i found the "perfect tool" for this: Word Vector Tool.
http://sourceforge.net/projects/wvtool/
mallet. including TF-IDF, POS, classification.
Sure there are many eg http://en.wikipedia.org/wiki/Lucene
However
I recommend that you write an basic IR system from scratch. Looking under the hood is always a great learning experience.

Categories

Resources