Java, Stanford NLP : Extract specific speech labels from parser - java

I recently discovered the Stanford NLP parser and it seems quite amazing. I have currently a working instance of it running in our project but facing the below mentioned 2 problems.
How can I parse text and then extract only specific speech-labels from the parsed data, for example, how can I extract only NNPS and PRP from the sentence.
Our platform works in both English and German, so there is always a possibility that the text is either in English or German. How can I accommodate this scenario. Thank you.
Code :
private final String PCG_MODEL = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
private final TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "invertible=true");
public void testParser() {
LexicalizedParser lp = LexicalizedParser.loadModel(PCG_MODEL);
String sent="Complete Howto guide to install EC2 Linux server in Amazon Web services cloud.";
Tree parse;
parse = lp.parse(sent);
List taggedWords = parse.taggedYield();
System.out.println(taggedWords);
}
The above example works, but as you can see I am loading the English data. Thank you.

Try this:
for (Tree subTree: parse) // traversing the sentence's parse tree
{
if(subTree.label().value().equals("NNPS")) //If the word's label is NNPS
{ //Do what you want }
}

For Query 1, I don't think stanford-nlp has an option to extract a specific POS tags.
However, Using custom trained models, we can achieve the same. I had tried similar requirement for NER - name Entity recognition custom models.

Related

Is there any Polish implementation for similar words in word2vec?

I found GoogleNews-vectors-negative300.bin library, but only for ENG words, Is there any Polish implementation for similar words in word2vec?
I have already tried using cc.pl.300.bin and NKJP-PodkorpusMilionowy libraries...
public Word2Vec getWord2Vec() {
File gModel = new File("C:/Users/user/Desktop/GoogleNews-vectors-negative300.bin.gz");
return WordVectorSerializer.readWord2VecModel(gModel);
}
The file...
https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.pl.vec
...as linked from...
https://fasttext.cc/docs/en/pretrained-vectors.html
...may work for you, if your library loads the simple 'text' format for exchanging word-vectors. (It's not in the Facebook FastText-specific binary format, as your cc.pl.300.bin file was.)

How to prepare training data for OpenNLP to Tokenize the token that contains more than one word?

In some language (for example: Vietnamese), some vocabulary consists of multiple words. So that some tokens which contain more than one word can be tokenized not just using the white space.
I have following input:
Người dân địa phương đã nhiều lần báo Điện lực Bến Tre nhưng chưa được giải quyết .
Expected output:
["Người dân", "địa phương", "đã", "nhiều", "lần", "báo", "Điện lực", "Bến Tre", "nhưng", "chưa", "được", "giải quyết"]
Training data I have _ connect the word that need to stick together in one token:
Người_dân địa_phương đã nhiều lần báo Điện_lực Bến_Tre nhưng chưa được giải_quyết .
Here is command line I use to train
opennlp TokenizerTrainer -model "model/vi-token.bin" -alphaNumOpt 1 -lang "vi" -data "data/merge_vlsp_removehtml" -encoding "UTF-8" -params param/wordseg.param
with param
Iterations=1000
However, the output cannot connect multiple word in one token but it split by whitespace.
Command I run to get output
opennlp TokenizerME model/vi-token.bin < sample/sample_text > sample/sample_text.out
What should I do with training data our config param to train the tokenizer with multiple word each token ?
Rather than using the underscore for training, use tags. OpenNLP uses tags as the reference for training. Follow the instructions given for NER and training your Tokenizer.
opennlp provides 'TokenizerTrainer' tool to train data. The OpenNLP format contains one sentence per line. You can also specify tokens either separated by a whitespace or by a special tag.
you can follow this blog for head start in opennlp for various purposes. The post will show you how to create a training file and build a new model.
You can easily create your own training data-set using the modelbuilder addon and follow some rules as mentioned here to train create a good NER model.
you can find some help using modelbuilder addon here.
It is basically, you put all the information in a text file and the NER entities in another. The addon searches for a paticular entity and replace it with the required tag. Hence producing the tagged data. It must be pretty easy to use this tool!
Also, follow mr. markg's answer to get an understanding on creating new models on your own. This will help you build your own models which can be customized for your applications.
Hope this helps!

Stanford-NER customization to classify software programming keywords

I am new in NLP and I used Stanford NER tool to classify some random text to extract special keywords used in software programming.
The problem is, I don't no how to do changes to the classifiers and text annotators in Stanford NER to recognize software programming keywords. For example:
today Java used in different operating systems (Windows, Linux, ..)
the classification results should such as:
Java "Programming_Language"
Windows "Operating_System"
Linux "Operating_system"
Would you please help on how to customize the StanfordNER classifiers to satisfied my needs?
I think it is quite well documented in Stanford NER faq section http://nlp.stanford.edu/software/crf-faq.shtml#a.
Here are the steps:
In your properties file change the map to specify how your training data is annotated (or
structured)
map = word=0,myfeature=1,answer=2
In src\edu\stanford\nlp\sequences\SeqClassifierFlags.java
Add a flag stating that you want to use your new feature, let's call it useMyFeature
Below public boolean useLabelSource = false , Add
public boolean useMyFeature= true;
In same file in setProperties(Properties props, boolean printProps) method after
else if (key.equalsIgnoreCase("useTrainLexicon")) { ..} tell tool, if this flag is on/off for you
else if (key.equalsIgnoreCase("useMyFeature")) {
useMyFeature= Boolean.parseBoolean(val);
}
In src/edu/stanford/nlp/ling/CoreAnnotations.java, add following
section
public static class myfeature implements CoreAnnotation<String> {
public Class<String> getType() {
return String.class;
}
}
In src/edu/stanford/nlp/ling/AnnotationLookup.java in
public enumKeyLookup{..} in bottom add
MY_TAG(CoreAnnotations.myfeature.class,"myfeature")
In src\edu\stanford\nlp\ie\NERFeatureFactory.java, depending on the
"type" of feature it is, add in
protected Collection<String> featuresC(PaddedList<IN> cInfo, int loc)
if(flags.useRahulPOSTAGS){
featuresC.add(c.get(CoreAnnotations.myfeature.class)+"-my_tag");
}
Debugging:
In addition to this, there are methods which dump the features on file, use them to see how things are getting done under hood. Also, I think you would have to spend some time with debugger too :P
Seems you want to train your custom NER model.
Here is a detailed tutorial with full code:
https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so
Training data format
Training data is passed as a text file where each line is one word-label pair. Each word in the line should be labeled in a format like "word\tLABEL", the word and the label name is separated by a tab '\t'. For a text sentence, we should break it down into words and add one line for each word in the training file. To mark the start of the next line, we add an empty line in the training file.
Here is a sample of the input training file:
hp Brand
spectre ModelName
x360 ModelName
home Category
theater Category
system 0
horizon ModelName
zero ModelName
dawn ModelName
ps4 0
Depending upon your domain, you can build such a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like a NER annotation tool can help make the process much easier.
Train model
public void trainAndWrite(String modelOutPath, String prop, String trainingFilepath) {
Properties props = StringUtils.propFileToProperties(prop);
props.setProperty("serializeTo", modelOutPath);
//if input use that, else use from properties file.
if (trainingFilepath != null) {
props.setProperty("trainFile", trainingFilepath);
}
SeqClassifierFlags flags = new SeqClassifierFlags(props);
CRFClassifier<CoreLabel> crf = new CRFClassifier<>(flags);
crf.train();
crf.serializeClassifier(modelOutPath);
}
Use the model to generate tags:
public void doTagging(CRFClassifier model, String input) {
input = input.trim();
System.out.println(input + "=>" + model.classifyToString(input));
}
Hope this helps.

Using the Stanford Dependency Parser on a previously tagged sentence

I'm currently using the Twitter POS tagger available here to tag out tweets into the Penn-Tree Bank tags.
Here is that code:
import java.util.List;
import cmu.arktweetnlp.Tagger;
import cmu.arktweetnlp.Tagger.TaggedToken;
/* Tags the tweet text */
List<TaggedToken> tagTweet(String text) throws IOException {
// Loads Penn Treebank POS tags
tagger.loadModel("res/model.ritter_ptb_alldata_fixed.txt");
// Tags the tweet text
taggedTokens = tagger.tokenizeAndTag(text);
return taggedTokens;
}
Now I need to identify where the direct objects are in these tags. After some searching, I've discovered that the Stanford Parser can do this, by way of the Stanford Typed Dependencies, (online example). By using the dobj() call, I should be able to get what I need.
However, I have not found any good documentation about how to feed already-tagged sentences into this tool. From what I understand, before using the Dependency Parser I need to create a tree from the sentence's tokens/tags. How is this done? I have not been able to find any example code.
The Twitter POS Tagger contains an instance of the Stanford NLP Tools, so I'm not far off, however I am not familiar enough with the Stanford tools to feed my POS-tagged text into it in order to get the dependency parser to work properly. The FAQ does mention this functionality, but without any example code to go off of, I'm a bit stuck.
Here is how it is done with completely manual creation of the List discussed in the FAQ:
String[] sent3 = { "It", "can", "can", "it", "." };
// Parser gets second "can" wrong without help (parsing it as modal MD)
String[] tag3 = { "PRP", "MD", "VB", "PRP", "." };
List<TaggedWord> sentence3 = new ArrayList<TaggedWord>();
for (int i = 0; i < sent3.length; i++) {
sentence3.add(new TaggedWord(sent3[i], tag3[i]));
}
Tree parse = lp.parse(sentence3);
parse.pennPrint();

How to obtain the "Grammatical Relation" using Stanford NLP Parser?

I am absolutely new to Java development.
Can someone please elaborate on how to obtain "Grammatical Relations" using the Stanfords's Natural Language Processing Lexical Parser- open source Java code?
Thanks!
See line 88 of first file in my code to run the Stanford Parser programmatically
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
System.out.println("words: "+words);
System.out.println("POStags: "+tags);
System.out.println("stemmedWordsAndTags: "+stems);
System.out.println("typedDependencies: "+tdl);
The collection tdl is a list of these typed dependencies. If you look on the javadoc for TypedDependency you'll see that using the .reln() method gets you the grammatical relation.
Lines 311-318 of the third file in my code show how to use that list of typed dependencies. I happen to get the name of the relation, but you could get the relation itself, which would be of the class GrammaticalRelation.
for( Iterator<TypedDependency> iter = tdl.iterator(); iter.hasNext(); ) {
TypedDependency var = iter.next();
TreeGraphNode dep = var.dep();
TreeGraphNode gov = var.gov();
// All useful information for a node in the tree
String reln = var.reln().getShortName();
Don't feel bad, I spent a miserable day or two trying to figure out how to use the parser. I don't know if the docs have improved, but when I used it they were pretty damn awful.

Categories

Resources