Stanford-NER customization to classify software programming keywords

Stanford-NER customization to classify software programming keywords - java

I am new in NLP and I used Stanford NER tool to classify some random text to extract special keywords used in software programming.
The problem is, I don't no how to do changes to the classifiers and text annotators in Stanford NER to recognize software programming keywords. For example:
today Java used in different operating systems (Windows, Linux, ..)
the classification results should such as:
Java "Programming_Language"
Windows "Operating_System"
Linux "Operating_system"
Would you please help on how to customize the StanfordNER classifiers to satisfied my needs?

I think it is quite well documented in Stanford NER faq section http://nlp.stanford.edu/software/crf-faq.shtml#a.
Here are the steps:
In your properties file change the map to specify how your training data is annotated (or
structured)
map = word=0,myfeature=1,answer=2
In src\edu\stanford\nlp\sequences\SeqClassifierFlags.java
Add a flag stating that you want to use your new feature, let's call it useMyFeature
Below public boolean useLabelSource = false , Add
public boolean useMyFeature= true;
In same file in setProperties(Properties props, boolean printProps) method after
else if (key.equalsIgnoreCase("useTrainLexicon")) { ..} tell tool, if this flag is on/off for you
else if (key.equalsIgnoreCase("useMyFeature")) {
useMyFeature= Boolean.parseBoolean(val);
}
In src/edu/stanford/nlp/ling/CoreAnnotations.java, add following
section
public static class myfeature implements CoreAnnotation<String> {
public Class<String> getType() {
return String.class;
}
}
In src/edu/stanford/nlp/ling/AnnotationLookup.java in
public enumKeyLookup{..} in bottom add
MY_TAG(CoreAnnotations.myfeature.class,"myfeature")
In src\edu\stanford\nlp\ie\NERFeatureFactory.java, depending on the
"type" of feature it is, add in
protected Collection<String> featuresC(PaddedList<IN> cInfo, int loc)
if(flags.useRahulPOSTAGS){
featuresC.add(c.get(CoreAnnotations.myfeature.class)+"-my_tag");
}
Debugging:
In addition to this, there are methods which dump the features on file, use them to see how things are getting done under hood. Also, I think you would have to spend some time with debugger too :P

Seems you want to train your custom NER model.
Here is a detailed tutorial with full code:
https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so
Training data format
Training data is passed as a text file where each line is one word-label pair. Each word in the line should be labeled in a format like "word\tLABEL", the word and the label name is separated by a tab '\t'. For a text sentence, we should break it down into words and add one line for each word in the training file. To mark the start of the next line, we add an empty line in the training file.
Here is a sample of the input training file:
hp Brand
spectre ModelName
x360 ModelName
home Category
theater Category
system 0
horizon ModelName
zero ModelName
dawn ModelName
ps4 0
Depending upon your domain, you can build such a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like a NER annotation tool can help make the process much easier.
Train model
public void trainAndWrite(String modelOutPath, String prop, String trainingFilepath) {
Properties props = StringUtils.propFileToProperties(prop);
props.setProperty("serializeTo", modelOutPath);
//if input use that, else use from properties file.
if (trainingFilepath != null) {
props.setProperty("trainFile", trainingFilepath);
}
SeqClassifierFlags flags = new SeqClassifierFlags(props);
CRFClassifier<CoreLabel> crf = new CRFClassifier<>(flags);
crf.train();
crf.serializeClassifier(modelOutPath);
}
Use the model to generate tags:
public void doTagging(CRFClassifier model, String input) {
input = input.trim();
System.out.println(input + "=>" + model.classifyToString(input));
}
Hope this helps.

Related

Parse a single POJO from multiple YAML documents representing different classes

I want to use a single YAML file which contains several different objects - for different applications. I need to fetch one object to get an instance of MyClass1, ignoring the rest of docs for MyClass2, MyClass3, etc. Some sort of selective de-serializing: now this class, then that one... The structure of MyClass2, MyClass3 is totally unknown to the application working with MyClass1. The file is always a valid YAML, of course.
The YAML may be of any structure we need to implement such a multi-class container. The preferred parsing tool is snakeyaml.
Is it sensible? How can I ignore all but one object?
UPD: replaced all "document" with "object". I think we have to speak about the single YAML document containing several objects of different structure. More of it, the parser knows exactly only 1 structure and wants to ignore the rest.
UDP2: I think it is impossible with snakeyaml. We have to read all objects anyway - and select the needed one later. But maybe I'm wrong.
UPD2: sample config file
---
-
exportConfiguration781:
attachmentFieldName: "name"
baseSftpInboxPath: /home/user/somedir/
somebool: false
days: 9999
expected:
- ABC w/o quotes
- "Cat ABC"
- "Some string"
dateFormat: yyyy-MMdd-HHmm
user: someuser
-
anotherConfiguration:
k1: v1
k2:
- v21
- v22

This is definitely possible with SnakeYAML, albeit not trivial. Here's a general rundown what you need to do:
First, let's have a look what loading with SnakeYAML does. Here's the important part of the YAML class:
private Object loadFromReader(StreamReader sreader, Class<?> type) {
Composer composer = new Composer(new ParserImpl(sreader), resolver, loadingConfig);
constructor.setComposer(composer);
return constructor.getSingleData(type);
}
The composer parses YAML input into Nodes. To do that, it doesn't need any knowledge about the structure of your classes, since every node is either a ScalarNode, a SequenceNode or a MappingNode and they just represent the YAML structure.
The constructor takes a root node generated by the composer and generates native POJOs from it. So what you want to do is to throw away parts of the node graph before they reach the constructor.
The easiest way to do that is probably to derive from Composer and override two methods like this:
public class MyComposer extends Composer {
private final int objIndex;
public MyComposer(Parser parser, Resolver resolver, int objIndex) {
super(parser, resolver);
this.objIndex = objIndex;
}
public MyComposer(Parser parser, Resolver resolver, LoaderOptions loadingConfig, int objIndex) {
super(parser, resolver, loadingConfig);
this.objIndex = objIndex;
}
#Override
public Node getNode() {
return strip(super.getNode());
}
private Node strip(Node input) {
return ((SequenceNode)input).getValue().get(objIndex);
}
}
The strip implementation is just an example. In this case, I assumed your YAML looks like this (object content is arbitrary):
- {first: obj}
- {second: obj}
- {third: obj}
And you simply select the object you actually want to deserialize by its index in the sequence. But you can also have something more complex like a searching algorithm.
Now that you have your own composer, you can do
Constructor constructor = new Constructor();
// assuming we want to get the object at index 1 (i.e. second object)
Composer composer = new MyComposer(new ParserImpl(sreader), new Resolver(), 1);
constructor.setComposer(composer);
MyObject result = (MyObject)constructor.getSingleData(MyObject.class);

The answer of #flyx was very helpful for me, opening the way to workaround the library (in our case - snakeyaml) limitations by overriding some methods. Thanks a lot! It's quite possible there is a final solution in it - but not now. Besides, the simple solution below is robust and should be considered even if we'd found the complete library-intruding solution.
I've decided to solve the task by double distilling, sorry, processing the configuration file. Imagine the latter consisting of several parts and every part is marked by the unique token-delimiter. For the sake of keeping the YAML-likenes, it may be
---
#this is a unique key for the configuration A
<some YAML document>
---
#this is another key for the configuration B
<some YAML document
The first pass is pre-processing. For the given String fileString and String key (and DELIMITER = "\n---\n". for example) we select a substring with the key-defined configuration:
int begIndex;
do {
begIndex= fileString.indexOf(DELIMITER);
if (begIndex == -1) {
break;
}
if (fileString.startsWith(DELIMITER + key, begIndex)) {
fileString = fileString.substring(begIndex + DELIMITER.length() + key.length());
break;
}
// spoil alien delimiter and repeat search
fileString = fileString.replaceFirst(DELIMITER, " ");
} while (true);
int endIndex = fileString.indexOf(DELIMITER);
if (endIndex != -1) {
fileString = fileString.substring(0, endIndex);
}
Now we feed the fileString to the simple YAML parsing
ExportConfiguration configuration = new Yaml(new Constructor(ExportConfiguration.class))
.loadAs(fileString, ExportConfiguration.class);
This time we have a single document that must co-respond to the ExportConfiguration class.
Note 1: The structure and even the very content of the rest of configuration file plays absolutely no role. This was the main idea, to get independent configurations in a single file
Note 2: the rest of configurations may be JSON or XML or whatever. We have a method-preprocessor that returns a String configuration - and the next processor parses it properly.

Java, Stanford NLP : Extract specific speech labels from parser

I recently discovered the Stanford NLP parser and it seems quite amazing. I have currently a working instance of it running in our project but facing the below mentioned 2 problems.
How can I parse text and then extract only specific speech-labels from the parsed data, for example, how can I extract only NNPS and PRP from the sentence.
Our platform works in both English and German, so there is always a possibility that the text is either in English or German. How can I accommodate this scenario. Thank you.
Code :
private final String PCG_MODEL = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
private final TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "invertible=true");
public void testParser() {
LexicalizedParser lp = LexicalizedParser.loadModel(PCG_MODEL);
String sent="Complete Howto guide to install EC2 Linux server in Amazon Web services cloud.";
Tree parse;
parse = lp.parse(sent);
List taggedWords = parse.taggedYield();
System.out.println(taggedWords);
}
The above example works, but as you can see I am loading the English data. Thank you.

Try this:
for (Tree subTree: parse) // traversing the sentence's parse tree
{
if(subTree.label().value().equals("NNPS")) //If the word's label is NNPS
{ //Do what you want }
}

For Query 1, I don't think stanford-nlp has an option to extract a specific POS tags.
However, Using custom trained models, we can achieve the same. I had tried similar requirement for NER - name Entity recognition custom models.

Stanford Core NLP: Entity type non deterministic

I had built a java parser using Stanford Core NLP. I am finding an issue in getting the consistent results with the CORENLP object. I am getting the different entity types for the same input text. It seems like a bug to me in CoreNLP. Wondering if any of the StanfordNLP users have encountered this issue and found workaround for the same. This is my Service class which I am instantiating and reusing.
class StanfordNLPService {
//private static final Logger logger = LogConfiguration.getInstance().getLogger(StanfordNLPServer.class.getName());
private StanfordCoreNLP nerPipeline;
/*
Initialize the nlp instances for ner and sentiments.
*/
public void init() {
Properties nerAnnotators = new Properties();
nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
nerPipeline = new StanfordCoreNLP(nerAnnotators);
}
/**
* #param text Text from entities to be extracted.
*/
public void printEntities(String text) {
// boolean tracking = PerformanceMonitor.start("StanfordNLPServer.getEntities");
try {
// Properties nerAnnotators = new Properties();
// nerAnnotators.put("annotators", "tokenize,ssplit,pos,lemma,ner");
// nerPipeline = new StanfordCoreNLP(nerAnnotators);
Annotation document = nerPipeline.process(text);
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// Get the entity type and offset information needed.
String currEntityType = token.get(CoreAnnotations.NamedEntityTagAnnotation.class); // Ner type
int currStart = token.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class); // token offset_start
int currEnd = token.get(CoreAnnotations.CharacterOffsetEndAnnotation.class); // token offset_end.
String currPos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class); // POS type
System.out.println("(Type:value:offset)\t" + currEntityType + ":\t"+ text.substring(currStart,currEnd)+"\t" + currStart);
}
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Discrepancy result: type changed from MISC to O from the initial use.
Iteration 1:
(Type:value:offset) MISC: Appropriate 100
(Type:value:offset) MISC: Time 112
Iteration 2:
(Type:value:offset) O: Appropriate 100
(Type:value:offset) O: Time 112

Here is the answer from the NER FAQ:
http://nlp.stanford.edu/software/crf-faq.shtml
Is the NER deterministic? Why do the results change for the same data?
Yes, the underlying CRF is deterministic. If you apply the NER to the same sentence more than once, though, it is possible to get different answers the second time. The reason for this is the NER remembers whether it has seen a word in lowercase form before.
The exact way this is used as a feature is in the word shape feature, which treats words such as "Brown" differently if it has or has not seen "brown" as a lowercase word before. If it has, the word shape will be "Initial upper, have seen all lowercase", and if it has not, the word shape will be "Initial upper, have not seen all lowercase".
This feature can be turned off in recent versions with the flag -useKnownLCWords false

I've looked over the code some, and here is a possible way to resolve this:
What you could do to solve this is load each of the 3 serialized CRF's with useKnownLCWords set to false, and serialize them again. Then supply the new serialized CRF's to your StanfordCoreNLP.
Here is a command for loading a serialized CRF with useKnownLCWords set to false, and then dumping it again:
java -mx600m -cp "*:." edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -useKnownLCWords false -serializeTo classifiers/new.english.all.3class.distsim.crf.ser.gz
Put whatever names you want to obviously! This command assumes you are in stanford-corenlp-full-2015-04-20/ and have a directory classifiers with the serialized CRF's. Change as appropriate for your set up.
This command should load the serialized CRF, override with the useKnownLCWords set to false, and then re-dump the CRF to new.english.all.3class.distsim.crf.ser.gz
Then in your original code:
nerAnnotators.put("ner.model","comma-separated-list-of-paths-to-new-serialized-crfs");
Please let me know if this works or if it's not working, and I can look more deeply into this!

After doing some research, I found the issue is in ClassifierCombiner.classify() method. One of the baseClassifiers edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz loaded by default is returning different type on some occasion. I am trying to load only the first model to resolve this issue.
The problem is the following area of the code
CRFClassifier.classifyMaxEnt()
int[] bestSequence = tagInference.bestSequence(model); Line 1249
ExactBestSequenceFinder.bestSequence() is returning different sequence for for the above model for the same input when called multiple times.
Not sure if this needs code fix or some configuration changes to the model. Any additional insight is appreciated.

How to extract one boolean field from XML?

I have a model which is in XML format as shown below and I need to parse the XML and check whether my XML has internal-flag flag set as true or not. In my other models, it might be possible, that internal-flag flag is set as false. And sometimes, it is also possible that this field won't be there so by default it will be false from my code.
<?xml version="1.0"?>
<ClientMetadata
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.google.com client.xsd"
xmlns="http://www.google.com">
<client id="200" version="13">
<name>hello world</name>
<description>hello hello</description>
<organization>TESTER</organization>
<author>david</author>
<internal-flag>true</internal-flag>
<clock>
<clock>
<for>
<init>val(tmp1) = 1</init>
<clock>
<eval><![CDATA[result("," + $convert(val(tmp1)))]]></eval>
</clock>
</for>
<for>
<incr>val(tmp1) -= 1</incr>
<clock>
<eval><![CDATA[result("," + $convert(val(tmp1)))]]></eval>
</clock>
</for>
</clock>
</clock>
</client>
</ClientMetadata>
I have a POJO in which I am storing my above model -
public class ModelMetadata {
private int modelId;
private String modelValue; // this string will have my above XML data as string
// setters and getters here
}
Now what is the best way to determine whether my model has internal-flag set as true or not?
// this list will have all my Models stored
List<ModelMetadata> metadata = getModelMetadata();
for (ModelMetadata model : metadata) {
// my model will be stored in below variable in XML format
String modelValue = model.getModelValue();
// now parse modelValue variable and extract `internal-flag` field property
}
Do I need to use XML parsing for this or is there any better way to do this?
Update:-
I have started using Stax and this is what I have tried so far but not sure how can I extract that field -
InputStream is = new ByteArrayInputStream(modelValue.getBytes());
XMLStreamReader r = XMLInputFactory.newInstance().createXMLStreamReader(is);
while(r.hasNext()) {
// now what should I do here?
}

There is an easy solution using XMLBeam (Disclosure: I'm affiliated with that project), just a few lines:
public class ReadBoolean {
public interface ClientMetaData {
#XBRead("//xbdefaultns:internal-flag")
boolean hasFlag();
}
public static void main(String[] args) throws IOException {
ClientMetaData clientMetaData = new XBProjector().io().url("res://xmlWithBoolean.xml").read(ClientMetaData.class);
System.out.println("Has flag:"+clientMetaData.hasFlag());
}
}
This program prints out
Has flag:true
for your XML.

You could also do some simple string parsing, but this will only work for small cases with proper XML and if there's only a single <internal-flag> element.
This is a simple solution to your problem without using any XML parsing utilities. Other solutions may be more robust or powerful.
Find the index of the string literal <internal-flag>. If it doesn't exist, return false.
Go forward "<internal-flag>".length (15) characters. Read up to the next </internal-flag>, which should be the string true or false.
Take that string, use Boolean.parseBoolean(String) to get a boolean value.
If you want me to help you out with the code just drop a comment!

If you are willing to consider adding Groovy to your mix (e.g. see the book Making Java Groovy) then using a Groovy XMLParser and associated classes will make this simple.
If you need to stick to Java, let me put in a shameless plug for my Xen library, which mimics a lot of the "Groovy way". The answer to your question would be:
Xen doc = new XenParser().parseText(YOUR_XML_STRING);
String internalFlag = doc.getText(".client.internal-flag");
boolean isSet = "true".equals(internalFlag);
If the XML comes from a File, Stream, or URI, that can be handled too.
Caveat emptor, (even though it is free) this is a fairly new library, written solely by a random person (me), and not thoroughly tested on all the crazy XML out there. If anybody knows of a similar, more "mainstream" library I'd be very interested in hearing about it.

Best way to parse commands in a java text-based game

I'm developing a text based game in java and I'm looking for the best way to deal with player's commands. Commands allow the player to interact with the environment, like :
"look north" : to have a full description of what you have in the north direction
"drink potion" : to pick an object named "potion" in your inventory and drink it
"touch 'strange button'" : touch the object called 'strange button' and trigger an action if there is one attached to it, like "oops you died..."
"inventory" : to have a full description of your inventory
etc...
My objective is now to develop a complete set of those simple commands but I'm having trouble to find an easy way to parse it. I would like to develop a flexible and extensible parser which could call the main command like "look", "use", "attack", etc... and each of them would have a specific syntax and actions in the game.
I found a lot of tools to parse command line arguments like -i -v --verbose but none of them seems to have the sufficient flexibility to fit my needs. They can parse one by one argument but without taking into account a specific syntax for each of them. I tried JCommander which seems to be perfect but I'm lost between what is an argument, a parameter, who call who, etc...
So if someone could help me to pick the correct java library to do that, that would be great :)

Unless you're dealing with complex command strings that involve for instance arithmetic expressions or well balanced parenthesis I would suggest you go with a plain Scanner.
Here's an example that I would find readable and easy to maintain:
interface Action {
void run(Scanner args);
}
class Drink implements Action {
#Override
public void run(Scanner args) {
if (!args.hasNext())
throw new IllegalArgumentException("What should I drink?");
System.out.println("Drinking " + args.next());
}
}
class Look implements Action {
#Override
public void run(Scanner args) {
if (!args.hasNext())
throw new IllegalArgumentException("Where should I look?");
System.out.println("Looking " + args.next());
}
}
And use it as
Map<String, Action> actions = new HashMap<>();
actions.put("look", new Look());
actions.put("drink", new Drink());
String command = "drink coke";
// Parse
Scanner cmdScanner = new Scanner(command);
actions.get(cmdScanner.next()).run(cmdScanner);
You could even make it fancier and use annotations instead as follows:
#Retention(RetentionPolicy.RUNTIME)
#interface Command {
String value();
}
#Command("drink")
class Drink implements Action {
...
}
#Command("look")
class Look implements Action {
...
}
And use it as follows:
List<Action> actions = Arrays.asList(new Drink(), new Look());
String command = "drink coke";
// Parse
Scanner cmdScanner = new Scanner(command);
String cmd = cmdScanner.next();
for (Action a : actions) {
if (a.getClass().getAnnotation(Command.class).value().equals(cmd))
a.run(cmdScanner);
}

I don't think you want to parse command line arguments. That would mean each "move" in your game would require running a new JVM instance to run a different program and extra complexity of saving state between JVM sessions etc.
This looks like a text based game where you prompt users for what to do next. You probably just want to have users enter input on STDIN.
Example, let's say your screen says:
You are now in a dark room. There is a light switch
what do you want to do?
1. turn on light
2. Leave room back the way you came.
Please choose option:
then the user types 1 or 2 or if you want to be fancy turn on light etc. then you readLine() from the STDIN and parse the String to see what the user chose. I recommend you look at java.util.Scannerto see how to easily parse text
Scanner scanner = new Scanner(System.in);
String userInput = scanner.readLine();
//parse userInput string here

the fun part of it is to have some command is human readable, which at the same time, it's machine parsable.
first of all, you needs to define the syntax of your language, for example:
look (north|south|east|west)
but it's in regular expression, it's generally speaking not a best way to explain a syntactical rule, so i would say this is better:
Sequence("look", Xor("north", "south", "east", "west"));
so by doing this, i think you've got the idea. you need to define something like:
public abstract class Syntax { public abstract boolean match(String cmd); }
then
public class Atom extends Syntax { private String keyword; }
public class Sequence extends Syntax { private List<Syntax> atoms; }
public class Xor extends Syntax { private List<Syntax> atoms; }
use a bunch of factory functions to wrap the constructors, returning Syntax. then you will have something like this eventually:
class GlobeSyntax
{
Syntax syntax = Xor( // exclusive or
Sequence(Atom("look"),
Xor(Atom("north"), Atom("south"), Atom("east"), Atom("west"))),
Sequence(Atom("drink"),
Or(Atom("Wine"), Atom("Drug"), Atom("Portion"))), // may drink multiple at the same time
/* ... */
);
}
or so.
now what you need is just a recursive parser according to these rules.
you can see, it's recursive structure, very easy to code up, and very easy to maintain. by doing this, your command is not only human readable, but machine parsable.
sure it's not finished yet, you needs to define action. but it's easy right? it's typical OO trick. all to need to do is to perform something when Atom is matched.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Stanford-NER customization to classify software programming keywords - java

Related

Parse a single POJO from multiple YAML documents representing different classes

Java, Stanford NLP : Extract specific speech labels from parser

Stanford Core NLP: Entity type non deterministic

How to extract one boolean field from XML?

Best way to parse commands in a java text-based game

Categories

Resources