Customized Relationship Extraction Between two Entities Stanford NLP - java

I am looking for similar logic like from here RelationExtraction NLP
as process explain in the answer, I am able to reach out NER and entity sinking but I am very confused with "slot filling" logic and not getting proper resources on Internet.
Here is my code sample
public static void main(String[] args) throws IOException, ClassNotFoundException {
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
//String text = "Mary has a little lamb. She is very cute."; // Add your text here!
String text = "Matrix Partners along with existing investors Sequoia Capital and Nexus Venture Partners has invested R100 Cr in Mumbai based food ordering app, TinyOwl. The series B funding will be used by the company to expand its geographical presence to over 50 cities, upgrade technology and enhance user experience.";
text+="In December last year, it raised $3 Mn from Sequoia Capital India and Nexus Venture Partners to deepen its presence in home market Mumbai. It was seeded by Deap Ubhi (who had earlier founded Burrp) and Sandeep Tandon.";
text+="Kunal Bahl and Rohit Bansal, were also said to be planning to invest in the company’s second round of fund raise.";
text+="Founded by Harshvardhan Mandad and Gaurav Choudhary, TinyOwl claims to have tie-up with 5,000 restaurants and processes almost 2000 orders. The app which competes with the likes of FoodPanda aims to process over 50,000 daily orders.";
text+="The top-line comes from the cut the company takes from each order placed through its app.";
text+="The startup is also planning to come with reviews which would make it a competitor of Zomato, valued at $660 Mn. Also, Zomato is entering the food ordering business to expand its offerings.";
text+="Recently another peer, Bengaluru based food delivery startup, SpoonJoy raised an undisclosed amount of funding from Sachin Bansal (Co-Founder Flipkart) and Mekin Maheshwari (CPO Flipkart), Abhishek Goyal (Founder, Tracxn) and Sahil Barua (Co-Founder, Delhivery).";
text+="-TechCrunch";
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
//System.out.println(" word \n"+word);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// System.out.println(" pos \n"+pos);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
//System.out.println(" ne \n"+ne);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeAnnotation.class);
System.out.println(" TREE \n"+tree);
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println(" dependencies \n"+dependencies);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);
System.out.println("graph \n "+graph);
}
}
This gives output with same entities combined now I have to take this ahead with finding relationship between these entities. Example from code string I should get in putput that "Matrix Partners" and "Sequoia Capital" has relation "investor" or similar kind of structure.
Please correct me if am wrong somewhere and lead me to correct way.

Related

How to extract Wikipedia entity matched to CoreEntityMention (WikiDictAnnotator)

I am running CoreNLP over some text, and matching the entities found to Wikipedia entities. I want to reconstruct the sentence providing the link and other useful information for the entities found.
The CoreEntityMention has an entity() method, but it just returns a String.
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitylink");
// set up pipeline
pipeline = new StanfordCoreNLP(props);
String doc = "text goes here";
pipeline.annotate(doc);
// Iterate the sentences
for (CoreSentence sentence : doc.sentences()) {
Go through all mentions
for (CoreEntityMention em : sentence.entityMentions()) {
System.out.println(em.sentence());
// Here I would like to extract the Wikipedia entity information
System.out.println(em.entity());
}
}
You just need to add the wikipedia page url.
So Neil_Armstrong maps to https://en.wikipedia.org/wiki/Neil_Armstrong.

Extracting Date using StanfordCoreNLP pipeline instead of AnnotationPipeline

When I used SUTime feature of StanfordCoreNLP using the code given in its documentation which involved the usage of AnnotationPipeline for creating a pipeline object, I was able to extract TIME from the string successfully.
The code used is :
But my project required StanfordCoreNLP pipeline so when I used the same pipeline to extract the TIME it was giving me a NULLPointerException.
My code is as follows:
The error I am encountering is as follows:
I also tried the solution suggested by #StanfordNLPHelp in this link :
Dates when using StanfordCoreNLP pipeline
The code is as follows :
But the error still persists:
The standard ner annotator will run SUTime. Please see this link for the Java API info:
https://stanfordnlp.github.io/CoreNLP/api.html
basic example:
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ie.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.*;
import java.util.*;
public class BasicPipelineExample {
public static String text = "Joe Smith was born in California. " +
"In 2017, he went to Paris, France in the summer. " +
"His flight left at 3:00pm on July 10th, 2017. " +
"After eating some escargot for the first time, Joe said, \"That was delicious!\" " +
"He sent a postcard to his sister Jane Smith. " +
"After hearing about Joe's trip, Jane decided she might go to France one day.";
public static void main(String[] args) {
// set up pipeline properties
Properties props = new Properties();
// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
// set a property for an annotator, in this case the coref annotator is being set to use the neural algorithm
props.setProperty("coref.algorithm", "neural");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object
CoreDocument document = new CoreDocument(text);
// annnotate the document
pipeline.annotate(document);
// examples
// 10th token of the document
CoreLabel token = document.tokens().get(10);
System.out.println("Example: token");
System.out.println(token);
System.out.println();
// text of the first sentence
String sentenceText = document.sentences().get(0).text();
System.out.println("Example: sentence");
System.out.println(sentenceText);
System.out.println();
// second sentence
CoreSentence sentence = document.sentences().get(1);
// list of the part-of-speech tags for the second sentence
List<String> posTags = sentence.posTags();
System.out.println("Example: pos tags");
System.out.println(posTags);
System.out.println();
// list of the ner tags for the second sentence
List<String> nerTags = sentence.nerTags();
System.out.println("Example: ner tags");
System.out.println(nerTags);
System.out.println();
// constituency parse for the second sentence
Tree constituencyParse = sentence.constituencyParse();
System.out.println("Example: constituency parse");
System.out.println(constituencyParse);
System.out.println();
// dependency parse for the second sentence
SemanticGraph dependencyParse = sentence.dependencyParse();
System.out.println("Example: dependency parse");
System.out.println(dependencyParse);
System.out.println();
// kbp relations found in fifth sentence
List<RelationTriple> relations =
document.sentences().get(4).relations();
System.out.println("Example: relation");
System.out.println(relations.get(0));
System.out.println();
// entity mentions in the second sentence
List<CoreEntityMention> entityMentions = sentence.entityMentions();
System.out.println("Example: entity mentions");
System.out.println(entityMentions);
System.out.println();
// coreference between entity mentions
CoreEntityMention originalEntityMention = document.sentences().get(3).entityMentions().get(1);
System.out.println("Example: original entity mention");
System.out.println(originalEntityMention);
System.out.println("Example: canonical entity mention");
System.out.println(originalEntityMention.canonicalEntityMention().get());
System.out.println();
// get document wide coref info
Map<Integer, CorefChain> corefChains = document.corefChains();
System.out.println("Example: coref chains for document");
System.out.println(corefChains);
System.out.println();
// get quotes in document
List<CoreQuote> quotes = document.quotes();
CoreQuote quote = quotes.get(0);
System.out.println("Example: quote");
System.out.println(quote);
System.out.println();
// original speaker of quote
// note that quote.speaker() returns an Optional
System.out.println("Example: original speaker of quote");
System.out.println(quote.speaker().get());
System.out.println();
// canonical speaker of quote
System.out.println("Example: canonical speaker of quote");
System.out.println(quote.canonicalSpeaker().get());
System.out.println();
}
}
You can remove the annotators after ner if you only want DATE's.
The same TIMEX3 format which resulted from using :
obj.get(TimeExpression.Annotation.class).getTemporal() ---> 2018-06-29T17:00
got stored in NormalizedNamedEntityTagAnnotation.class when I used ner tagger along with StanfordCoreNLP pipeline. The detailed information can be found in the documentation of Stanford Temporal Tagger
The following code worked fine in extracting the date:

StanfordNLP: models from kbp not found (Eclipse)

I am a bit new to Java and Eclipse. I usually use python and Nltk for NLP task..
I am trying to follow the tutorial provided here
package edu.stanford.nlp.examples;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ie.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.*;
import java.util.*;
public class BasicPipelineExample {
public static String text = "Joe Smith was born in California. " +
"In 2017, he went to Paris, France in the summer. " +
"His flight left at 3:00pm on July 10th, 2017. " +
"After eating some escargot for the first time, Joe said, \"That was delicious!\" " +
"He sent a postcard to his sister Jane Smith. " +
"After hearing about Joe's trip, Jane decided she might go to France one day.";
public static void main(String[] args) {
// set up pipeline properties
Properties props = new Properties();
// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
// set a property for an annotator, in this case the coref annotator is being set to use the neural algorithm
props.setProperty("coref.algorithm", "neural");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object
CoreDocument document = new CoreDocument(text);
// annnotate the document
pipeline.annotate(document);
// examples
// 10th token of the document
CoreLabel token = document.tokens().get(10);
System.out.println("Example: token");
System.out.println(token);
System.out.println();
// text of the first sentence
String sentenceText = document.sentences().get(0).text();
System.out.println("Example: sentence");
System.out.println(sentenceText);
System.out.println();
// second sentence
CoreSentence sentence = document.sentences().get(1);
// list of the part-of-speech tags for the second sentence
List<String> posTags = sentence.posTags();
System.out.println("Example: pos tags");
System.out.println(posTags);
System.out.println();
// list of the ner tags for the second sentence
List<String> nerTags = sentence.nerTags();
System.out.println("Example: ner tags");
System.out.println(nerTags);
System.out.println();
// constituency parse for the second sentence
Tree constituencyParse = sentence.constituencyParse();
System.out.println("Example: constituency parse");
System.out.println(constituencyParse);
System.out.println();
// dependency parse for the second sentence
SemanticGraph dependencyParse = sentence.dependencyParse();
System.out.println("Example: dependency parse");
System.out.println(dependencyParse);
System.out.println();
// kbp relations found in fifth sentence
List<RelationTriple> relations =
document.sentences().get(4).relations();
System.out.println("Example: relation");
System.out.println(relations.get(0));
System.out.println();
// entity mentions in the second sentence
List<CoreEntityMention> entityMentions = sentence.entityMentions();
System.out.println("Example: entity mentions");
System.out.println(entityMentions);
System.out.println();
// coreference between entity mentions
CoreEntityMention originalEntityMention = document.sentences().get(3).entityMentions().get(1);
System.out.println("Example: original entity mention");
System.out.println(originalEntityMention);
System.out.println("Example: canonical entity mention");
System.out.println(originalEntityMention.canonicalEntityMention().get());
System.out.println();
// get document wide coref info
Map<Integer, CorefChain> corefChains = document.corefChains();
System.out.println("Example: coref chains for document");
System.out.println(corefChains);
System.out.println();
// get quotes in document
List<CoreQuote> quotes = document.quotes();
CoreQuote quote = quotes.get(0);
System.out.println("Example: quote");
System.out.println(quote);
System.out.println();
// original speaker of quote
// note that quote.speaker() returns an Optional
System.out.println("Example: original speaker of quote");
System.out.println(quote.speaker().get());
System.out.println();
// canonical speaker of quote
System.out.println("Example: canonical speaker of quote");
System.out.println(quote.canonicalSpeaker().get());
System.out.println();
}
}
but I always get the following output containing an error, and this happen for all modules relating to kbp, and I did add the jar files as requested by the tutorial:
Adding annotator tokenize No tokenizer type provided. Defaulting to
PTBTokenizer. Adding annotator ssplit Adding annotator pos Loading POS
tagger from
edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger
... done [0.9 sec]. Adding annotator lemma Adding annotator ner
Loading classifier from
edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ...
done [1.4 sec]. Loading classifier from
edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ...
done [1.8 sec]. Loading classifier from
edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz
... done [0.6 sec]. Exception in thread "main"
edu.stanford.nlp.io.RuntimeIOException: Couldn't read TokensRegexNER
from edu/stanford/nlp/models/kbp/regexner_caseless.tab at
edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:593)
at
edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.(TokensRegexNERAnnotator.java:293)
at
edu.stanford.nlp.pipeline.NERCombinerAnnotator.setUpFineGrainedNER(NERCombinerAnnotator.java:209)
at
edu.stanford.nlp.pipeline.NERCombinerAnnotator.(NERCombinerAnnotator.java:152)
at
edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$45(StanfordCoreNLP.java:546)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$70(StanfordCoreNLP.java:625)
at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:126) at
edu.stanford.nlp.util.Lazy.get(Lazy.java:31) at
edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:149)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:495)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:201)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:194)
at
edu.stanford.nlp.pipeline.StanfordCoreNLP.(StanfordCoreNLP.java:181)
at NLP.Start.main(Start.java:13) Caused by: java.io.IOException:
Unable to open "edu/stanford/nlp/models/kbp/regexner_caseless.tab" as
class path, filename or URL at
edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:481)
at edu.stanford.nlp.io.IOUtils.readerFromString(IOUtils.java:618) at
edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:590)
... 14 more
Do you have any idea to fix this? Thanks in advance!
Probably you forgot to add stanford-corenlp-3.9.1-models.jar to your class path.
Well, according to the models page, there are is a separate model download for the kbp material. Perhaps you have access to stanford-english-corenlp-2018-02-27-models, but not access to stanford-english-kbp-corenlp-2018-02-27-models? I would guess this because it appears that other models were found from what you provided us in the question.

testing OpenNLP classifier model

I'm currently training a model for a classifier. yesterday I found out that it will be more accurate if you also test the created classify model. I tried searching on the internet how to test a model : testing openNLP model. But I cant get it to work. I think the reason is because i'm using OpenNLP version 1.83 instead of 1.5. Could anyone explain me how to properly test my model in this version of OpenNLP?
Thanks in advance.
Below is the way im training my model:
public static DoccatModel trainClassifier() throws IOException
{
// read the training data
final int iterations = 100;
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("src/main/resources/trainingSets/trainingssetTest.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
// define the training parameters
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, iterations+"");
params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
// create a model from traning data
DoccatModel model = DocumentCategorizerME.train("NL", sampleStream, params, new DoccatFactory());
return model;
}
I can think of two ways to test your model. Either way, you will need to have annotated documents (an by annotated I really mean expert-classified).
The first way involves using the opennlp DocCatEvaluator. The syntax would be something akin to
opennlp DoccatEvaluator -model model -data sampleData
The format of your sampleData should be
OUTCOME <document text....>
documents are separated by the new line character.
The second way involves creating an DocumentCategorizer. Something like:
(the model is the DocCat model from your question)
DocumentCategorizer categorizer = new DocumentCategorizerME(model);
// could also use: Tokenizer tokenizer = new TokenizerME(tokenizerModel)
Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE();
// linesample is like in your question...
for(String sample=linesample.read(); sample != null; sample=linesample.read()){
String[] tokens = tokenizer.tokenize(sample);
double[] outcomeProb = categorizer.categorize(tokens);
String sampleOutcome = categorizer.getBestCategory(outcomeProb);
// check if the outcome is right...
// keep track of # right and wrong...
}
// calculate agreement metric of your choice
Since I typed the code here there may be a syntax error or two (either I or the SO community can fix), but the idea for running through your data, tokenizing, running it through the document categorizer and keeping track of the results is how you want to evaluate your model.
Hope it helps...

parsing multiple lines with regex

I'm writing a program in Java that parse bibtex library file. each entry should be parsed to
field and value. this is an example of one single bibtex from a library.
#INPROCEEDINGS{conf/icsm/Ceccato07,
author = {Mariano Ceccato},
title = {Migrating Object Oriented code to Aspect Oriented Programming},
booktitle = {ICSM},
year = {2007},
pages = {497--498},
publisher = {IEEE},
bibdate = {2008-11-18},
bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/icsm/icsm2007.html#Ceccato07},
crossref = {conf/icsm/2007},
owner = {Administrator},
timestamp = {2009.04.30},
url = {http://dx.doi.org/10.1109/ICSM.2007.4362668}
}
in this case, I just read the line and split it using the method split. for example, the first entry (author) is parsed like this:
Scanner in = new Scanner(new File(library.bib));
in.nextLine(); //skip the header
String input = in.nextLine(); //read (author = {Mariano Ceccato},)
String field = input.split("=")[0].trim(); //field = "author"
String value = input.split("=")[1]; //value = "{Mariano Ceccato},"
value = value.split("\\}")[0]; //value = "{Mariano Ceccato"
value = value.split("\\{")[1]; //value = "Mariano Ceccato"
value = value.trim; //remove any white spaces (if any)
up to know every thing is good. However there are a bibtex in the library that has multiple lines' value:
#ARTICLE{Aksit94AbstractingCF,
author = {Mehmet Aksit and Ken Wakita and Jan Bosch and Lodewijk Bergmans and
Akinori Yonezawa },
title = {{Abstracting Object Interactions Using Composition Filters}},
journal = {Lecture Notes in Computer Science},
year = {1994},
volume = {791},
pages = {152--??},
acknowledgement = {Nelson H. F. Beebe, Center for Scientific Computing, University of
Utah, Department of Mathematics, 110 LCB, 155 S 1400 E RM 233, Salt
Lake City, UT 84112-0090, USA, Tel: +1 801 581 5254, FAX: +1 801
581 4148, e-mail: \path|beebe#math.utah.edu|, \path|beebe#acm.org|,
\path|beebe#computer.org|, \path|beebe#ieee.org| (Internet), URL:
\path|http://www.math.utah.edu/~beebe/|},
bibdate = {Mon May 13 11:52:14 MDT 1996},
coden = {LNCSD9},
issn = {0302-9743},
owner = {aljasser},
timestamp = {2009.01.08}
}
as you see, the acknowledgement field it more than a line, so I can't read it using nextLine(). My parsing function works fine with it if I passed it as a String to it. So what is the best way to read this entry and other multiple lines entry and stile be able to read single line entries ?
The form of these entries is
#<type>{<Id>
<name>={<value>},
....
<name>={<value>}
}
Note that the last name-value pair is not followed by a comma.
If a value is split over several lines, then that simply means that a particular line does not yet contain the closing brace. In that case, scan the next line and append it to the string you are about to split. Keep doing this until the last characters in the string are "}," or "}" (this latter would happen if the 'acknowledgement' was the last name-value pair in the record).
For extra safety, count that the number of closing braces matches the number of opening braces, and keep appending lines to your string until it does. This would be to cover situations where you have a long title in an article that happened to unfortunately break at the wrong place, such as
title = {{Abstracting Object Interactions Using Composition Filters, and other stuff}
},
For these king of issues, it is always better to use a specific parser.
I googled for bibtex parser and find this.
If you like to have your own as what you are doing, one sulotion to this problem is to check whether
the line ends with }, if not append the current line with the next one.
Having said that, there might be other issues, that's why I suggested using a parser

Categories

Resources