How can we find the word phrases in a synset ? In particular, take this synset for the adj "booked":
booked, engaged, set-aside -- (reserved in advance)
I use the RitaWN Java package (WordNet version is 2.1), and cannot seem to find the phrases. In the example above, when I run
RiWordnet wordnet = new RiWordnet(null);
String[] syn = wordnet.getSynset(word, "a", true);
for(int i = 0; i < syn.length; i++)
System.out.println(syn[i]);
It only outputs
booked engaged
While "set-aside" is not listed.
I have tested a lot and all phrases are not found. Another example:
commodity, trade good, good -- (articles of commerce)
then "trade good" is not returned from the getSynset() method. So how can we actually get phrases ?
(the ritawn package is obtained from http://rednoise.org/rita/wordnet/documentation/index.htm)
RiTaWN seems to ignore "compound-words" by default. You can disable this to get the full list of phrases (line 2 below).
RiWordnet wordnet = new RiWordnet();
wordnet.ignoreCompoundWords(false);
String[] syn = wordnet.getSynset("booked", "a", true);
System.out.println(Arrays.asList(syn));
Result:
[INFO] RiTa.WordNet.version [033]
[booked, engaged, set-aside]
This answer is a bit off right field but in any case...
Idilia has an online Wordnet-like database that is actually much more complete and richer than Wordnet. Depending on where you are in your application it may make sense so I'm mentioning it. There are coding examples for Java access on the site.
In this case the query:
[{"fs":"booked/J1", "lemma":[], "definition":null}]
would return
{
"fs" : "booked/J1",
"lemma" : [
"set_aside",
"set-aside",
"engaged",
"booked"
],
"definition" : "reserved in advance."
}
Related
I am working on a project for a beginners java course, I need to read a file and turn each line into an object, which i will eventually print out as a job listing. (please no ArrayList suggestions)
so far i have gotten that file saved into a String[], which contains strings like this:
*"iOS/Android Mobile App Developer - Java, Swift","Freshop, Inc.","$88,000 - $103,000 a year"
"Security Engineer - Offensive Security","Indeed","$104,000 - $130,000 a year"
"Front End Developer - CSS/HTML/Vue","HiddenLevers","$80,000 - $130,000 a year"*
what im having trouble with is trying to split each string into its three parts so it can be inputted into my JobService createJob method which is as shown:
public Job createJob(String[] Arrs) {
Job job = new Job();
job.setTitle(Arrs[0]);
job.setCompany(Arrs[1]);
job.setCompensation(Arrs[2]);
return job;
}
I am terrible at regex but know that trying to .split(",") will break up the salary portion as well. if anyone could help figure out a reliable way to split these strings to fit into my method i would be grateful!!!
Also im super new, please use language the commoners like me will understand...
You need a slightly better split criteria, something like \"," for example...
String text = "\"iOS/Android Mobile App Developer - Java, Swift\",\"Freshop, Inc.\",\"$88,000 - $103,000 a year\"";
String[] parts = text.split("\",");
for (String part : parts) {
System.out.println(part);
}
Which prints...
"iOS/Android Mobile App Developer - Java, Swift
"Freshop, Inc.
"$88,000 - $103,000 a year"
Now, if you want to remove the quotes, you can do something like....
String text = "\"iOS/Android Mobile App Developer - Java, Swift\",\"Freshop, Inc.\",\"$88,000 - $103,000 a year\"";
String[] parts = text.split("\",");
for (String part : parts) {
System.out.println(part.replace("\"", ""));
}
Regular Expression
No, I'm not that good at it either. I tried...
String[] parts = text.split("^\"|\",\"|\"$");
And while this works, it produces 4 elements, not 3 (first match is blank).
You could remove the first and trailing quotes and then just use "," instead...
text = text.substring(1, text.length() - 2);
String[] parts = text.split("\",\"");
trim leading and trailing quotes
split on ","
As code:
String[] columns = line.replaceAll("^\"|\"$", "").split("\",\"");
^"|"$ means "a quote at start or a quote at end"
The regex for the split is just a literal ","
First off let me say that I am a complete newbie with NLP. Although, as you read on, that is probably going to become strikingly apparent.
I'm parsing Wikipedia pages to find all mentions of the page title. I do this by going through the CorefChainAnnotations to find "proper" mentions - I then assume that the most common ones are talking about the page title. I do it by running this:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,coref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String content = "Abraham Lincoln was an American politician and lawyer who served as the 16th President of the United States from March 1861 until his assassination in April 1865. Lincoln led the United States through its Civil War—its bloodiest war and perhaps its greatest moral, constitutional, and political crisis.";
Annotation document = new Annotation(content);
pipeline.annotate(document);
for (CorefChain cc : document.get(CorefCoreAnnotations.CorefChainAnnotation.class).values()) {
List<CorefChain.CorefMention> corefMentions = cc.getMentionsInTextualOrder();
for (CorefChain.CorefMention cm : corefMentions) {
if (cm.mentionType == Dictionaries.MentionType.PROPER) {
log("Proper ref using " + cm.mentionSpan + ", " + cm.mentionType);
}
}
}
This returns:
Proper ref using the United States
Proper ref using the United States
Proper ref using Abraham Lincoln
Proper ref using Lincoln
I know already that "Abraham Lincoln" is definitely what I am looking for and I can surmise that because "Lincoln" appears a lot as well then that must be another way of talking about the main subject. (I realise right now the most common named entity is "the United States", but once I've fed it the whole page it works fine).
This works great until I have a page like "Gone with the Wind". If I change my code to use that:
String content = "Gone with the Wind has been criticized as historical revisionism glorifying slavery, but nevertheless, it has been credited for triggering changes to the way African-Americans are depicted cinematically.";
then I get no Proper mentions back at all. I suspect this is because none of the words in the title are recognised as named entities.
Is there any way I can get Stanford NLP to recognise "Gone with the Wind" as an already-known named entity? From looking around on the internet it seems to involve training a model, but I want this to be a known named entitity just for this single run and I don't want the model to remember this training later.
I can just imagine NLP experts rolling their eyes at the awfulness of this approach, but it gets better! I came up with the great idea of changing any occurences of the page title to "Thingamijig" before passing the text to Stanford NLP, which works great for "Gone with the Wind" but then fails for "Abraham Lincoln" because (I think) the NER longer associates "Lincoln" with "Thingamijig" in the corefMentions.
In my dream world I would do something like:
pipeline.addKnownNamedEntity("Gone with the Wind");
But that doesn't seem to be something I can do and I'm not exactly sure how to go about it.
You can submit a dictionary with any phrases you want and have them recognized as named entities.
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.regexner.mapping additional.rules -file example.txt -outputFormat text
additional.rules
Gone With The Wind MOVIE MISC 1
Note that the columns above should be tab-delimited. You can have as many lines as you'd like in the additional.rules file.
One warning, EVERY TIME that token pattern occurs it will be tagged.
More details here: https://stanfordnlp.github.io/CoreNLP/ner.html
I am starting to learn the OpenNLP API by Jave.
I found some good examples in this website
http://www.programcreek.com/2012/05/opennlp-tutorial/
I have tried the Name Finder API but I found something strange.
If I replace the input as
String []sentence = new String[]{
"John",
"is",
"good"
};
The code is still working, but if I change it as
String []sentence = new String[]{
"John",
"is",
"fine"
};
There is no output.
I cannot understand what causes the problem. Is it form the model I use? (en-ner-person.bin)
And does anyone know how can I build my own model?
Thanks!
Assuming it is not throwing an exception and just can't find the name "John," It's not working because the model cannot find John when the sentence is "John is fine" because OpenNLP is a Machine learning approach and it finds Named entities based on a model. The en-person.bin model apparently does not have sufficient samples of sentences similar enough to "john is fine" to return a probability high enough to give you a response.
I wanted to add a new synset to Wordnet using the extjwnl library. In order to do this, I wrote the following sample code. After saving, I observe that the new synonym and word do get added, but the semantic pointer created (which identifies the hyponymy relation) is not saved. How do I relate the pointer to the dictionary?
JWNL.initialize(new FileInputStream(propsFile));
Dictionary dictionary = Dictionary.getInstance();
Iterator<Synset> synsets = dictionary.getSynsetIterator(POS.NOUN);
dictionary.edit();
Synset newSynset = new Synset(dictionary, POS.NOUN);
IndexWord newWord = new IndexWord(dictionary, "hublabooboo", POS.NOUN, newSynset);
Synset topmostSynset = synsets.next();
Pointer newPointer = new Pointer(PointerType.HYPONYM, topmostSynset, newSynset);
dictionary.save();
I'd suggest you add pointer to the synset's list of pointers:
topmostSynset.getPointers().add(newPointer);
If the pointer is symmetric (like hypernym, which has a mirror one: hyponym), and dictionary.getManageSymmetricPointers() then the reverse pointer (e.g. hyponym) is added automatically.
By the way, by this code Synset topmostSynset = synsets.next(); it looks like you infer that the first returned synset from the synset iterator is the "entity" one. But this is not guaranteed anywhere. This is dictionary-dependent: might work for file-based, but most likely won't for map-based and unpredictable for database-based.
Source : SourceForge
I am trying to get all the noun phrases using the edu.stanford.nlp.* package. I got all the subtrees of label value "NP", but I am not able to get the normal original String format (not Penn Tree format).
E.g. for the subtree.toString() gives (NP (ND all)(NSS times))) but I want the string "all times". Can anyone please help me. Thanks in advance.
I believe what you want is something like:
final StringBuilder sb = new StringBuilder();
for ( final Tree t : tree.getLeaves() ) {
sb.append(t.toString()).append(" ");
}
While I'm not 100% sure, I seem to recall this being the solution used for some software I worked on a few years back.
This can be accomplished using the yield() method for the subtree, instead of creating a separate StringBuilder objext.
if (subtree.label().value().equals("NP")) {
out.println(subtree); //print subtree
out.println(Sentence.listToString(subtree.yield())); //print phrase
break;
}