I've been experimenting with Stanford NLP toolkit and its lemmatization capabilities. I am surprised how it lemmatize some words. For example:
depressing -> depressing
depressed -> depressed
depresses -> depress
It is not able to transform depressing and depressed into the same lemma. Simmilar happens with confusing and confused, hopelessly and hopeless. I am getting the feeling that the only thing it is able to do is remove the s if the word is in such form (e.g. feels -> feel). Is such behaviour normal for Lematizatiors in English? I would expect that they would be able to transform such variations of common words into a same lemma.
If this is normal, should I rather use stemmers? And, is there a way to use stemmers like Porter (Snowball, etc.) in StanfordNLP? There is no mention of stemmers in their documentation; however, there are some CoreAnnotations.StemAnnotation in the API. If not possible with StanfordNLP which stemmers do you recommend for use in Java?
Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma.
In the sentence "This is confusing", confusing is analyzed as an adjective, and therefore it is lemmatized to confusing. In the sentence "I was confusing you with someone else", by contrast, confusing is analyzed as a verb, and is lemmatized to confuse.
If you want tokens with different parts of speech to be mapped to the same lemma, you can use a stemming algorithm such as Porter Stemming, which you can simply call on each token.
An adding to yvespeirsman's answer:
I see that, when applying lemmatization, we should make sure that the text keeps its punctuation, that is, punctuation removal must come before lemmatization, since the lemmatizer takes into account type of the words (part of speech) when performing its task.
Notice the words confuse and confusing in the examples below.
With punctuation:
for token in nlp("This is confusing. You are confusing me."):
print(token.lemma_)
Output:
this
be
confusing
.
-PRON-
be
confuse
-PRON-
.
Without punctuation:
for token in nlp("This is confusing You are confusing me"):
print(token.lemma_)
Output:
this
be
confuse
-PRON-
be
confuse
-PRON-
Related
I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.
I want to identify all the names written in any text, currently I am using IMDB movie reviews.
I am using stanford POS tagger, and analysing all the proper nouns (as proper noun are names of person,things,places), but this is slow.
Firstly I am tagging all the input lines, then I am checking for all the words with NNP in the end, which is a slow process.
Is there any efficient substitute to achieve this task? ANy library (preferably in JAVA).
Thanks.
Do you know the input language? If yes you could match each word against a dictionnary and flag the word as proper noun if it is not in the dictionnary. It would require a complete dictionnary with all the declensions of each word of the language, and pay attention to numbers and other special cases.
EDIT: See also this answer in the official FAQ: have you tried to change the model used?
A (paid) web service called GlobalNLP can do it in multiple languages: https://nlp.linguasys.com/docs/services/54131f001c78d802f0f2b28f/operations/5429f9591c78d80a3cd66926
I want to write a code to match certain words. I don't care about the form of the word, it could be a noun and adding -ing to it, it can become a verb. Eg, add = adding, recruit = recruiting. Also, like recruit = recruitment = recruiter.
In simple words, all forms of the words are equal. Is there any Java program that I can use to achieve this.
I am somewhat familiar to Apache's OpenNLP, so if that could help in any way?
Thanks!!
It sounds like you want a stemmer or lemmatizer. You might want to check out Stanford CoreNLP which includes a lemmatizer. You might also want to try the Porter Stemmer.
My guess is that these will cover some of the cases but not all of them. For example "recruitment" won't be lemmatized to "recruit." For that, you'd need a more complex morphological analyzer but I don't know of a good existing system.
Hi I just started learning NLP and chose Stanford api to do all my required tasks. I am able to do POS and NER tasks but I am stuck with co-reference resolution. I am even able to get the 'corefChaingraph' and able to print all the representative mention and corresponding mentions to console. But, I really would like to know how to get the finalized text after resolving the co-references. Can some one help me regarding this?
example:
Input sentence:
John Smith talks about the EU. He likes the family of nations.
Expected ouput:
John Smith talks about the EU. John Smith likes the family of nations.
It depends a lot on what approach you take. I personally would try and solve this looking at what role a word plays in a sentence and what is the context carried forward. Based on the POS tags, try and map subject-verb-object model. Once you have subject and objects identified you can build a simple context carry forward rule system to achieve what you want.
e.g.
Based on the tags below:
[('John', 'NNP'), ('Smith', 'NNP'), ('talks', 'VBZ'), ('about', 'IN'), ('the', 'DT'), ('EU.', 'NNP'), ('He', 'NNP'), ('likes', 'VBZ'), ('the', 'DT'), ('family', 'NN'), ('of', 'IN'), ('nations', 'NNS'), ('.', '.')]
You can create chunks:
[['noun_type', 'John', 'Smith'], ['verb_type', 'talks'], ['in_type', 'about'], ['noun_type', 'the', 'EU']]
[['noun_type', 'He'], ['verb_type', 'likes'], ['noun_type', 'the', 'family'], ['in_type', 'of'], ['noun_type', 'nations']]
Once you have these chunks, parse them left to right putting them in Subject-Verb-Object form.
Now based on this, you know what is the context carry forward.
e.g.: "He" means the subject is getting carry forward. "It" means the object (this is a very basic example. You can build a robust rule based systems for patterns.) I have tried many approaches in past and this one gave me best results.
I hope I helped.
In my experience that problem you are trying to solve is not completely solved but there are many people working on it. I tried "karaka" approach. Not just to get subject-verb-object but also the other references from sentence.
Here is how I approached a problem:
step 1: Detect the voice of the sentence.
Step 2: For active voice, parse a POS tagged sentence from left to right to get
subject-verb-object (It will be always in that form for active voice). For passive voice look for "by" and take the next noun as a subject.
Looking at your example:
In both the sentences you have Noun-Verb-In-Noun structure. Which you can easily parse as first noun is subject then verb then IN (about is indicative to object) and then noun again. From there rules: John Smith is subject, Talks is action and EU is object.
Karaka theory in linguistic will also help you with other roles.
E.g.: John Smith talks about EU in Paris.
Here when you enouter in (IN tag) Paris (NNP tag) you can have a rule that tells you "in/on/around/inside/outside" are locative references.
Similarly "with/without" and instrumentative "for" is dative.
I basically trust this deep parsing and rule systems when I have to deal with a single word and the rule it plays in a sentence.
I have good amount of accuracy with this approach.
I'm parsing a text file made from this Wikipedia article, basically I made a Ctrl+A and copy/paste all the content in a text file. (I use it as example).
I'm trying to make a list of words with their counts and for that I use a Scanner with this delimiter :
sc.useDelimiter("[\\p{javaWhitespace}\\p{Punct}]+");
It works great for my need, but analysing the result, I saw something that looks like a blank token (again...). The character is after (nynorsk) in the article (funny when I copy/paste here the character disappear, in gedit I can use → and ← and the cursor don't move).
After further research I've found out that this token was actually the POP DIRECTIONAL FORMATTING (U+202C).
It's not the only directional character, looking at the Character documentation Java seems to define them.
So I'm wondering if there is a standard way to detect these characters, and if possible a way that can be easily integrated in the delimiter pattern.
I'd like to avoid to make my own list because I fear I will forgot some of them.
You could always go the other way round and use a whitelist rather than a blacklist:
sc.useDelimiter("[^\\p{L}]+");