First off let me say that I am a complete newbie with NLP. Although, as you read on, that is probably going to become strikingly apparent.
I'm parsing Wikipedia pages to find all mentions of the page title. I do this by going through the CorefChainAnnotations to find "proper" mentions - I then assume that the most common ones are talking about the page title. I do it by running this:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String content = "Abraham Lincoln was an American politician and lawyer who served as the 16th President of the United States from March 1861 until his assassination in April 1865. Lincoln led the United States through its Civil War—its bloodiest war and perhaps its greatest moral, constitutional, and political crisis.";
Annotation document = new Annotation(content);
pipeline.annotate(document);
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentions) {
if (cm.mentionType == Dictionaries.MentionType.PROPER) {
log("Proper ref using " + cm.mentionSpan + ", " + cm.mentionType);
}
}
}
This returns:
Proper ref using the United States
Proper ref using the United States
Proper ref using Abraham Lincoln
Proper ref using Lincoln
I know already that "Abraham Lincoln" is definitely what I am looking for and I can surmise that because "Lincoln" appears a lot as well then that must be another way of talking about the main subject. (I realise right now the most common named entity is "the United States", but once I've fed it the whole page it works fine).
This works great until I have a page like "Gone with the Wind". If I change my code to use that:
String content = "Gone with the Wind has been criticized as historical revisionism glorifying slavery, but nevertheless, it has been credited for triggering changes to the way African-Americans are depicted cinematically.";
then I get no Proper mentions back at all. I suspect this is because none of the words in the title are recognised as named entities.
Is there any way I can get Stanford NLP to recognise "Gone with the Wind" as an already-known named entity? From looking around on the internet it seems to involve training a model, but I want this to be a known named entitity just for this single run and I don't want the model to remember this training later.
I can just imagine NLP experts rolling their eyes at the awfulness of this approach, but it gets better! I came up with the great idea of changing any occurences of the page title to "Thingamijig" before passing the text to Stanford NLP, which works great for "Gone with the Wind" but then fails for "Abraham Lincoln" because (I think) the NER longer associates "Lincoln" with "Thingamijig" in the corefMentions.
In my dream world I would do something like:
pipeline.addKnownNamedEntity("Gone with the Wind");
But that doesn't seem to be something I can do and I'm not exactly sure how to go about it.
You can submit a dictionary with any phrases you want and have them recognized as named entities.
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping additional.rules -file example.txt -outputFormat text
additional.rules
Gone With The Wind MOVIE MISC 1
Note that the columns above should be tab-delimited. You can have as many lines as you'd like in the additional.rules file.
One warning, EVERY TIME that token pattern occurs it will be tagged.
More details here: https://stanfordnlp.github.io/CoreNLP/ner.html
Related
In some language (for example: Vietnamese), some vocabulary consists of multiple words. So that some tokens which contain more than one word can be tokenized not just using the white space.
I have following input:
Người dân địa phương đã nhiều lần báo Điện lực Bến Tre nhưng chưa được giải quyết .
Expected output:
["Người dân", "địa phương", "đã", "nhiều", "lần", "báo", "Điện lực", "Bến Tre", "nhưng", "chưa", "được", "giải quyết"]
Training data I have _ connect the word that need to stick together in one token:
Người_dân địa_phương đã nhiều lần báo Điện_lực Bến_Tre nhưng chưa được giải_quyết .
Here is command line I use to train
opennlp TokenizerTrainer -model "model/vi-token.bin" -alphaNumOpt 1 -lang "vi" -data "data/merge_vlsp_removehtml" -encoding "UTF-8" -params param/wordseg.param
with param
Iterations=1000
However, the output cannot connect multiple word in one token but it split by whitespace.
Command I run to get output
opennlp TokenizerME model/vi-token.bin < sample/sample_text > sample/sample_text.out
What should I do with training data our config param to train the tokenizer with multiple word each token ?
Rather than using the underscore for training, use tags. OpenNLP uses tags as the reference for training. Follow the instructions given for NER and training your Tokenizer.
opennlp provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one sentence per line. You can also specify tokens either separated by a whitespace or by a special tag.
you can follow this blog for head start in opennlp for various purposes. The post will show you how to create a training file and build a new model.
You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.
you can find some help using modelbuilder addon here.
It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!
Also, follow mr. markg's answer to get an understanding on creating new models on your own. This will help you build your own models which can be customized for your applications.
Hope this helps!
From what I understand from the example of POS Tagging given in the examples of jcrfsuite. The training file is tab separated and first token is the label. But I do not get the BigCluster| thing. Can somebody help me with how to specify tokens in training file.
Example below:
O BigCluster|00 BigCluster|0000 BigCluster|000000 BigCluster|00000000 BigCluster|0000000000 BigCluster|000000000000 BigCluster|00000000000000 BigCluster|0000000000000000 NextBigCluster|0100 NextBigCluster|01000101 NextBigCluster|010001011111 POSTagDict|D POSTagDict|N POSTagDict|^ POSTagDict|$ POSTagDict|G NextPOSTag|V 1gramSuff|i 1gramPref|i prevword| prevcurr||i nextword|predict nextword|predict currnext|i|predict Word|I Lower|i Xxdshape|X charclass|1, first-shortcap prevnext||predict t=0
Test file format:
! BigCluster|01 BigCluster|0110 BigCluster|011011 BigCluster|01101100 BigCluster|0110110011 BigCluster|011011001100 BigCluster|01101100110000 BigCluster|0110110011000000 NextBigCluster|1000 NextBigCluster|10001000 NextBigCluster|100010000000 POSTagDict|V NextPOSTag|, metaph_POSDict|N 1gramSuff|n 2gramSuff|nn 3gramSuff|mnn 4gramSuff|mmnn 5gramSuff|mmmnn 6gramSuff|ammmnn 7gramSuff|aammmnn 8gramSuff|aaammmnn 9gramSuff|daaammmnn 1gramPref|d 2gramPref|da 3gramPref|daa 4gramPref|daaa 5gramPref|daaam 6gramPref|daaamm 7gramPref|daaammm 8gramPref|daaammmn 9gramPref|daaammmnn prevword| prevcurr||daaammmnn nextword|. nextword|. currnext|daaammmnn|. Word|Daaammmnn Lower|daaammmnn Xxdshape|Xxxxxxxxx charclass|1,2,2,2,2,2,2,2,2, first-initcap prevnext||. t=0
What is specified after the label is a list of feature-name and feature-value.
It is in a sparse representation instead of tabular representation.
BigCluster is just one of the features and it's relevant to the specific example only. You should create your own features if you are training from scratch.
I have noticed that CRFsuite does not care for the naming convention nor feature design of labels and attributes, because treats them as strings.
CRFsuite learns weights of associations (feature weights) between attributes and labels, without knowing the meaning of labels and attributes. In other words, one can design and use arbitrary features just by writing label and attribute names in data sets, just find the best posible attributes for your example and run some experiments with different sets of attributes and features. And you will good to go.
I downloaded and used OpenIE4.1 jar file (downloadable from http://knowitall.github.io/openie/) to process some free text documents and produced triplet-like outputs along with the text and confidence score, for instance,
The rail launchers are conceptually similar to the underslung SM-1
0.93 (The rail launchers; are; conceptually similar to the underslung SM-1)
I wrote a java parser to extract OpenIE triplets which confidence score is >= 0.85 and
need to know the way to convert it to N-triplet (NT), format look like.
Not sure if I need to be familiar with the ontology that I'm trying to map to.
After discussion with my colleagues. This is what I should do to create N-Triplet(NT) and Detailed Java codes can be found in another Question: Use RDF API (Jena, OpenRDF or Protege) to convert OpenIE outputs
Create a blank node identifier for each distinct :subject in the file (call it node_s)
Create a blank node identifier for each distinct :object in the file (call it node_o)
Define a URI for each distinct predicate
Create these triples:
1. node_s rdf:type <http://mypage.org/vocab#Corpus>
2. node_s dc:title “The rail launchers”
3. node_s dc:source “Sample File”
4. node_s rdf:predicate <http://mypage.org/vocab#are>
5. node_o rdf:type <http://mypage.org/vocab#Corpus>
6. node_o dc:title “conceptually similar to the underslung SM-1”
I am starting to learn the OpenNLP API by Jave.
I found some good examples in this website
http://www.programcreek.com/2012/05/opennlp-tutorial/
I have tried the Name Finder API but I found something strange.
If I replace the input as
String []sentence = new String[]{
"John",
"is",
"good"
};
The code is still working, but if I change it as
String []sentence = new String[]{
"John",
"is",
"fine"
};
There is no output.
I cannot understand what causes the problem. Is it form the model I use? (en-ner-person.bin)
And does anyone know how can I build my own model?
Thanks!
Assuming it is not throwing an exception and just can't find the name "John," It's not working because the model cannot find John when the sentence is "John is fine" because OpenNLP is a Machine learning approach and it finds Named entities based on a model. The en-person.bin model apparently does not have sufficient samples of sentences similar enough to "john is fine" to return a probability high enough to give you a response.
I've a String from html web page like this:
String htmlString =
<span style="mso-bidi-font-family:Gautami;mso-bidi-theme-font:minor-bidi">President Pranab pay great
tributes to Motilal Nehru on occasion of
</span>
150th birth anniversary. Pranab said institutions evolved by
leaders like him should be strengthened instead of being destroyed.
<span style="mso-spacerun:yes">
</span>
He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of
Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly,
the first set of coins and postal stamps released at the function to commemorate the event.
</p>
i need to extract the text from above String ,after extraction my out put should look like
OutPut:
President Pranab pay great tributes to Motilal Nehru on occasion of 150th birth anniversary. Pranab said institutions evolved by leaders like him should be strengthened instead of being destroyed. He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly, now Parliament. Calling himself a student of history, he said Motilal's Swaraj Party acted as a disciplined assault force in the Legislative Assembly and he was credited with evolving the system of a Public Accounts Committee which is now one of the most effective watchdogs over executive in matters of money and finance. Mukherjee also received the first set of coins and postal stamps released at the function to commemorate the event.
For this i have used below logic:
int spanIndex = content.indexOf("<span");
spanIndex = content.indexOf(">", spanIndex);
int endspanndex = content.indexOf("</span>", spanIndex);
content = content.substring(spanIndex + 1, endspanndex);
and my Resultant out put is:
President Pranab pay great tributes to Motilal Nehru on occasion of
I have used Different HTMLParsers,but those are not working in case of j2me
can any one help me to get full description text? thanks .....
If you are using BlackBerry OS 5.0 or later you can use the BrowserField to parse HTML into a DOM document.
You may continue the same way as you propose with the rest of the string. Alternatively, a simple finite-state automaton would solve this. I have seen such solution in the moJab procect (you can download the sources here). In the mojab.xml package, there is a minimalistic XML parser designed for j2me. I mean it would parse your example as well. Take look at the sources, it's just three simple clases. It seems to be usable without modifications.
We can Extract the Text In Case of j2me as it is not suporting HTMLParsers,like this:
private String removeHtmlTags(String content) {
while (content.indexOf("<") != -1) {
int beginTag;
int endTag;
beginTag = content.indexOf("<");
endTag = content.indexOf(">");
if (beginTag == 0) {
content = content.substring(endTag
+ 1, content.length());
} else {
content = content.substring(0, beginTag) + content.substring(endTag
+ 1, content.length());
}
}
return content;
}
JSoup is a very popular library for extracting text from HTML documents. Here is one such example of the same.