I want to identify all the names written in any text, currently I am using IMDB movie reviews.
I am using stanford POS tagger, and analysing all the proper nouns (as proper noun are names of person,things,places), but this is slow.
Firstly I am tagging all the input lines, then I am checking for all the words with NNP in the end, which is a slow process.
Is there any efficient substitute to achieve this task? ANy library (preferably in JAVA).
Thanks.
Do you know the input language? If yes you could match each word against a dictionnary and flag the word as proper noun if it is not in the dictionnary. It would require a complete dictionnary with all the declensions of each word of the language, and pay attention to numbers and other special cases.
EDIT: See also this answer in the official FAQ: have you tried to change the model used?
A (paid) web service called GlobalNLP can do it in multiple languages: https://nlp.linguasys.com/docs/services/54131f001c78d802f0f2b28f/operations/5429f9591c78d80a3cd66926
Related
I've been experimenting with Stanford NLP toolkit and its lemmatization capabilities. I am surprised how it lemmatize some words. For example:
depressing -> depressing
depressed -> depressed
depresses -> depress
It is not able to transform depressing and depressed into the same lemma. Simmilar happens with confusing and confused, hopelessly and hopeless. I am getting the feeling that the only thing it is able to do is remove the s if the word is in such form (e.g. feels -> feel). Is such behaviour normal for Lematizatiors in English? I would expect that they would be able to transform such variations of common words into a same lemma.
If this is normal, should I rather use stemmers? And, is there a way to use stemmers like Porter (Snowball, etc.) in StanfordNLP? There is no mention of stemmers in their documentation; however, there are some CoreAnnotations.StemAnnotation in the API. If not possible with StanfordNLP which stemmers do you recommend for use in Java?
Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma.
In the sentence "This is confusing", confusing is analyzed as an adjective, and therefore it is lemmatized to confusing. In the sentence "I was confusing you with someone else", by contrast, confusing is analyzed as a verb, and is lemmatized to confuse.
If you want tokens with different parts of speech to be mapped to the same lemma, you can use a stemming algorithm such as Porter Stemming, which you can simply call on each token.
An adding to yvespeirsman's answer:
I see that, when applying lemmatization, we should make sure that the text keeps its punctuation, that is, punctuation removal must come before lemmatization, since the lemmatizer takes into account type of the words (part of speech) when performing its task.
Notice the words confuse and confusing in the examples below.
With punctuation:
for token in nlp("This is confusing. You are confusing me."):
print(token.lemma_)
Output:
this
be
confusing
.
-PRON-
be
confuse
-PRON-
.
Without punctuation:
for token in nlp("This is confusing You are confusing me"):
print(token.lemma_)
Output:
this
be
confuse
-PRON-
be
confuse
-PRON-
I want to write a code to match certain words. I don't care about the form of the word, it could be a noun and adding -ing to it, it can become a verb. Eg, add = adding, recruit = recruiting. Also, like recruit = recruitment = recruiter.
In simple words, all forms of the words are equal. Is there any Java program that I can use to achieve this.
I am somewhat familiar to Apache's OpenNLP, so if that could help in any way?
Thanks!!
It sounds like you want a stemmer or lemmatizer. You might want to check out Stanford CoreNLP which includes a lemmatizer. You might also want to try the Porter Stemmer.
My guess is that these will cover some of the cases but not all of them. For example "recruitment" won't be lemmatized to "recruit." For that, you'd need a more complex morphological analyzer but I don't know of a good existing system.
I have been working on information extraction and was able to run standAloneAnnie.java
http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/StandAloneAnnie.java
My question is, How can I use GATE ANNIE to get similar words like if I input (dine) will get result like (food, eat, dinner, restaurant) ?
More Information:
I am doing a project where I was assigned to develop a simple webpage to take user input and pass to GATE components which will tokenize the query and return a semantic grouping for each phrase in order to make some recommendation.
For example user would enter "I want to have dinner in Kuala Lumpur" and the system will break it down to (Search for :dinner - Required: restaurant, dinner, eat, food - Location: Kuala Lumpur.
ANNIE by default has like 15 annotations, see demo
http://services.gate.ac.uk/annie/
Now I already implemented everything as the demo but my question is. Can I do that using GATE ANNIE, i mean is it possible to find words synonyms or group words based on their type (noun, verbs)?
Plain vanilla ANNIE doesn't support this kind of thing but there are third party plugins such as Phil Gooch's WordNet Suggester that might help. Or if your domain is fairly restricted you might get better results with less effort by simply creating your own gazetteer lists of related terms and a few simple JAPE rules. You may find the training materials available on the GATE Wiki useful if you haven't done much of this before.
I need to extract the keywords of a search query. For example, suppose one searches for "latest popular Nokia phones". I want to extract the keywords of this phrase. Are there any libraries written in Java to get this done?
AFAIK the Apache Lucene is doing such a thing (removing words like a, an, the and so one). It provides also a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities and much more.
Is there a way to get the subject of a sentence using OpenNLP?
I'm trying to identify the most important part of a users sentence. Generally, users will be submitting sentences to our "engine" and we want to know exactly what the core topic is of that sentence.
Currently we are using openNlp to:
Chunk the sentence
Identify the noun-phrase, verbs, etc of the sentence
Identify all "topics" of the sentence
(NOT YET DONE!) Identify the "core topic" of the sentence
Please let me know if you have any bright ideas..
Dependency Parser
If you're interested in extracting grammatical relations such as what word or phrase is the subject of a sentence, you should really use a dependency parser. While OpenNLP does support phrase structure parsing, I don't think it does dependency parsing yet.
Opensource Software
Packages written in Java that support dependency parsing include:
MaltParser
MSTParser
Stanford Parser (demo, see typed dependencies section)
RelEx
Of these, the Stanford Parser is the most accurate. However, some configurations of the MaltParser can be insanely fast (Cer et al. 2010).
For the grammatical subject you'd need to rely on configurational information in the tree. If the parse looks something like (TOP (S (NP ----) (VP ----))) then you can take the NP as the subject; often, though not at all always, that will be the case. However only some sentences will have this configuration; one can easily imagine structures with subjects that are not in that position -- passive constructions, for example.
You're probably better off using MaltParser though.