How to retrieve all variants of a lexeme in Java? - java

I am searching for a way to retrieve all variants of the lexeme of a specific word.
Example: running -> (run, runs, ran, running…)
I tried out Stanford NLP according to this post. However, the lemma-annotator only retrieves the lemma (running -> run), not the complete set of variants. Is there a way to do this with Stanford NLP or another Java Lib/Framework?
Clarification: I do not search for a stemmer. Also, I would like to avoid programming a new algorithm from scratch to crawl WordNet or similar dictionaries.

The short answer is that a standard NLP library or toolkit is unlikely to solve this problem. Like Stanford NLP, most libraries will only provide a mapping from word --> lemma. Note that this is a many-to-one function, i.e., the inverse function is not well-defined in a word space. It is, however, a well defined function from the space of words to the space of sets of words (i.e., it's a one-to-many mapping in word-space).
Without some form of explicit mapping being maintained, it is impossible to generate all the variants from a given lemma. This is a theoretical impossibility because lemmatization is a lossy, one-way function.
You can, however, generate a mapping of lemma --> set-of-words without much coding (and definitely without coding a new algorithm):
// Java
Map<String, Set<String>> inverseLemmaMap = new HashMap<>();
// Guava
Multimap<String, String> inverseLemmaMap = HashMultimap.create();
Then, as you annotate your corpus using Stanford NLP, you can obtain the lemma and its corresponding token, and populate the above map (or multimap). This way, after a single pass over your dataset, you will have the required inverse lemmatization.
Note that this will be restricted to the corpus/dataset you are using, and not all words in the English language will be included.
Another note is that people often think that an inflection is uniquely determined by the part of speech. This is incorrect:
String s = "My running was beginning to hurt me. I was running all day."
The first instance of running is tagged NN, while the second instance is the present continuous tense of the verb, tagged VBG. This is what I meant by "lossy, one-way function" earlier in my answer.

Related

Is there any way to write parsing logic using json?

I have a map in java Map<String,Object> dataMap whose content looks like this -
{country=Australia, animal=Elephant, age=18}
Now while parsing the map the use of various conditional statements may be made like-
if(dataMap.get("country").contains("stra")
OR
if(dataMap.get("animal") || 100 ==0)
OR
Some other operation inside if
I want to create a config file that contains all the rules on how the data inside the Map should look like. In simple words, I want to define the conditions that value corresponding to keys country, animal, and age should follow, what operations should be performed on them, all in the config file, so that the if elses and extra code can be removed. The config file will be used for parsing the map.
Can someone tell me how such a config file can be written, and how can it be used inside Java?
Sample examples and code references will be of help.
I am thinking of creating a json file for this purpose
Example -
Boolean b = true;
List<String> conditions = new ArrayList<>();
if(dataMap.get("animal").toString().contains("pha")){
conditions.add("condition1 satisifed");
if(((Integer.parseInt(dataMap.get("age").toString()) || 100) ==0)){
conditions.add("condition2 satisifed");
if(dataMap.get("country").equals("Australia")){
conditions.add("condition3 satisifed");
}
else{
b=false;
}
}
else{
b=false;
}
}
else{
b=false;
}
Now suppose I want to define the conditions in a config file for each map value like the operation ( equals, OR, contains) and the test values, instead of using if else's. Then the config file can be used for parsing the java map
Just to manage expectations: Doing this in JSON is a horrible, horrible idea.
To give you some idea of what you're trying to make:
Grammars like this are best visualized as a tree structure. The 'nodes' in this tree are:
'atomics' (100 is an atom, so is "animal", so is dataMap).
'operations' (+ is an operation, so is or / ||).
potentially, 'actions', though you can encode those as operations.
Java works like this, so do almost all programming languages, and so does a relatively simple 'mathematical expression engine', such as something that can evaluate e.g. the string "(1 + 2) * 3 + 5 * 10" into 59.
In java, dataMap.get("animal") || 100 ==0 is parsed into this tree:
OR operation
/ \
INVOKE get[1] equality
/ \ / \
dataMap "animal" INT(100) INT(0)
where [1] is stored as INVOKEVIRTUAL java.util.Map :: get(Object) with as 'receiver' an IDENT node, which is an atomic, with value dataMap, and an args list node which contains 1 element, the string literal atomic "animal", to be very precise.
Once you see this tree you see how the notion of precedence works - your engine will need to be capable of representing both (1 + 2) * 3 as well as 1 + (2 * 3), so doing this without trees is not really possible unless you delve into bizarre syntaxis, where the lexical ordering matching processing ordering (if you want that, look at how reverse polish notation calculators work, or something like fortran - stack based language design. I don't think you'll like what you find there).
You're already making language design decisions here. Apparently, you think the language should adopt a 'truthy'/'falsy' concept, where dataMap.get("animal") which presumably returns an animal object, is to be considered as 'true' (as you're using it in a boolean operation) if, presumably, it isn't null or whatnot.
So, you're designing an entire programming language here. Why handicap yourself by enforcing that it is written in, of all things, JSON, which is epically unsuitable for the job? Go whole hog and write an entire language. It'll take 2 to 3 years, of course. Doing it in json isn't going to knock off more than a week off of that total, and make something that is so incredibly annoying to write, nobody would ever do it, buying you nothing.
The language will also naturally trend towards turing completeness. Once a language is turing complete, it becomes mathematically impossible to answer such questions as: "Is this code ever going to actually finish running or will it loop forever?" (see 'halting problem'), you have no idea how much memory or CPU power it takes, and other issues that then result in security needs. These are solvable problems (sandboxing, for example), but it's all very complicated.
The JVM is, what, 2000 personyears worth of experience and effort?
If you got 2000 years to write all this, by all means. The point is: There is no 'simple' way here. It's a woefully incomplete thing that never feels like you can actually do what you'd want to do (which is express arbitrary ideas in a manner that feels natural enough, can be parsed by your system, and when you read back still makes sense), or it's as complex as any language would be.
Why not just ... use a language? Let folks write not JSON but write full blown java, or js, or python, or ruby, or lua, or anything else that already exists, is open source, seems well designed?

What are trained models in NLP?

I am new to Natural language processing. Can anyone tell me what are the trained models in either OpenNLP or Stanford CoreNLP? While coding in java using apache openNLP package, we always have to include some trained models (found here http://opennlp.sourceforge.net/models-1.5/ ). What are they?
A "model" as downloadable for OpenNLP is a set of data representing a set of probability distributions used for predicting the structure you want (e.g. part-of-speech tags) from the input you supply (in the case of OpenNLP, typically text files).
Given that natural language is context-sensitive†, this model is used in lieu of a rule-based system because it generally works better than the latter for a number of reasons which I won't expound here for the sake of brevity. For example, as you already mentioned, the token perfect could be either a verb (VB) or an adjective (JJ) and this can only be disambiguated in context:
This answer is perfect — for this example, the following sequences of POS tags are possible (in addition to many more‡):
DT NN VBZ JJ
DT NN VBZ VB
However, according to a model which accurately represents ("correct") English§, the probability of example 1 is greater than of example 2: P([DT, NN, VBZ, JJ] | ["This", "answer", "is", "perfect"]) > P([DT, NN, VBZ, VB] | ["This", "answer", "is", "perfect"])
†In reality, this is quite contentious, but I stress here that I'm talking about natural language as a whole (including semantics/pragmatics/etc.) and not just about natural-language syntax, which (in the case of English, at least) is considered by some to be context-free.
‡When analyzing language in a data-driven manner, in fact any combination of POS tags is "possible", but, given a sample of "correct" contemporary English with little noise, tag assignments which native speakers would judge to be "wrong" should have an extremely low probability of occurrence.
§In practice, this means a model trained on a large, diverse corpus of (contemporary) English (or some other target domain you want to analyze) with appropriate tuning parameters (If I want to be even more precise, this footnote could easily be multiple paragraphs long).
Think of trained model as a "wise brain with existing information".
When you start out machine learning, the brain for your model is clean and empty. You can either download trained model or you can train your own model (like teaching a child)
Usually you only train models for edge cases else you download "Trained models" and get to work in predicting/machine learning.

Best algorithm for analyzing unique sentences and filtering them?

I am in the middle of writing some code to filter sentences into different groups.
The sentences are formed from the descriptions of incident tickets that my servicedesk have processed.
I have to filter them based on 5 catergories; Laptop,Telephony,Network, Printer,Application.
An example of a description from the application catergory is: "Please can you install CMS on XXXX YYYYYYY laptop"
I understand that it is impossible to get this perfect. But I was wondering what the best way to tackle this is? As you can see from the example it falls into the application category but contains a keyword "laptop".
If theres any more information I can provide you with please let me know. Every little helps. Thanks
Maintain different list or queues for different categories.
When you receive sentence, check for keyword occurrence in that sentence and add/push to appropriate list/queue.
you can maintain a map which tells you which list/queue for which keyword.
Interesting question! As seen in your example, there can be multiple keywords within the same sentence, making it difficult to decipher which category the sentence will belong to.
In order to get around this, I would suggest possibly using a separate priority queue for each category, containing keywords for each category in order of priority.
For example, you would have a priority queue of keywords for the Application category, and (within that priority queue) "install" would be of higher priority than "laptop" or "computer", because "install" is more closely related to applications than "laptop".
In your algorithm for choosing which category a sentence is part of, I would do a round-robin search through all five priority queues until a match is found - the highest priority match out of all five categories takes the sentence. This is one possible solution I can think of.
NOTE: For this to work properly, of course it is important to pick and choose carefully which keywords go into which categories; for example, in the Laptop category, it may seem natural to have "laptop" be the highest priority keyword - however, this would cause lots of collisions because laptop will probably be a very commonly used word in sentences. You should have very specific keywords pertaining to each category, rather than having broad/surface level keywords like "laptop" (or have "laptop" be a very low priority keyword).
This is actually a machine learning problem (text categorization) that you could solve using several algorithms: support vector machines, multinomial logistic regression, naive bayes and more.
There are many libraries which will help you, here is one (java)
http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
Also python has a very good library:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier
If you want to take this approach, you are going to need a training dataset, meaning that you need to manually label a set of documents that the algorithm will use to automatically learn which keywords are important.
Hope it helps!
If you only have the reach from receiving these sentences and sending/doing logic,
why not just filter them by regex?
See for example,
Regex to find a specific word in a string in java
e.g.
List<String> LaptopList = new ArrayList<String>();
for (String item : sentenceList) {
if item.matches(".*\\blaptop\\b.*"){
LaptopList.add(item);
}
}
You are looking at the keyword "Laptop". But there is a keyword install "install" which primary tells about installation of some application.
So you can try like
if( sentence.contains("install") || (sentence.contains("install") && sentence.contains("laptop") )
{
applicationTickets.add(sentence);
}
else if(sentence.contains("laptop") || other conditions)
{
laptopTickets.add(sentence);
}
else if( )
..........
else if( )
..........
If you observe the code, the applications category is placed first because It is matching with the terms of Laptop. So through this code trying to fall that sentence into laptop category.
You can use loops for checking all the conditions. The keywords can be added to the specify list for every category.

Using multiple classifiers in a Java program

I am using Stanford named entity recognition system to identify named entities in my queries.
I discover that one of the classifier (english.all.3class.distsim.crf.ser.gz) identify Person named entity more than the other (english.muc.7class.distsim.crf.ser.gz). While the second classifier identify Organization named entity more than the first classifier.
The question is how do i modify my code to combine both the performance of 3class and 7class classifiers. I mean how to combine line 2 and 3. Below is my program
public void main () {
//String serializedClassifier = "classifiers/english.all.3class.distsim.crf.ser.gz";
String serializedClassifier = "classifiers/english.muc.7class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions (serializedClassifier);
//String s5 = "Access Team Microsoft";
String s5 = " Victor Vianu";
String ans4 = classifier.classifyToString(s5);
System.out.println(ans4);
}
You are actually rephrasing a common optimization problem for classification tasks. It is possible to combine inputs from different named entity recognizers, but it is very unlikely that you will be able to do this in code. In that case, you would in practice be writing rule sets to express the preference of one annotation over the other. Such rule sets can become very complex and are very difficult to maintain, even if you are using a specialized framework.
Normally, an additional classifier is trained using both inputs along with annotation performed by humans. This is called supervised machine learning. If you are interested in the topic, have a look at the Gate framework, which provides a relatively gentle (GUI and lots of documentation available) introduction into text engineering and machine learning. You may be interested in the section about setting up machine learning: https://gate.ac.uk/sale/tao/splitch19.html#x24-46100019.2
One idea is to get the scores or confidence of the ner tagger (3class, 7class etc.) and select the results of one classifier or another based on this score, for each token that has been identified as an entity of a certain type.
You can do that by creating an e.g.,
List> results3ClassClassifier = 3Class.classify(textWithInstances);
(you get the results of the 7Class classifier as well,)
then you can have access to the scores , for instance:
for (List<CoreLabel> sentence : results3ClassClassifier) {
Triple<Counter<Integer>, Counter<Integer>, TwoDimensionalCounter<Integer, String>> scoresPerClass = 3Class.printProbsDocument(sentence);
//take score corresponding to a tagged token
//put that in a set of scores to get later max confidence
}
THis will give you something like:
Victor null O=0.011843980924431554 PERSON=0.9836181561256115 DATE=1.8909193023530869E-6 LOCATION=9.292607205855801E-5 ORGANIZATION=0.004434710463296233 PERCENT=1.168471927873708E-6 MONEY=3.925377823223501E-6 TIME=3.241645549570373E-6
Vianu null O=0.0019172569594321069 PERSON=0.9933531725355365 DATE=8.384789134033624E-6 LOCATION=2.134536699512499E-4 ORGANIZATION=0.004496303036216309 PERCENT=1.6256270957396572E-6 MONEY=7.507677140678878E-6 TIME=2.2957054950451435E-6
for Victor Vianu (great researcher!) this is straightforward, but for less known entities, the fact that you can get the confidence is useful in practice.

ANTLR: Multiple ASTs using the same ambiguous grammar?

I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.
Example:
query : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN
| ANY_TOKEN UNCLASSIFIED_TOKEN
;
In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.
Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?
Thanks
If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in
C* X;
in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.
If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.
Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.
DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.
I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.
I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.
Good luck

Categories

Resources