I wanted to add a new synset to Wordnet using the extjwnl library. In order to do this, I wrote the following sample code. After saving, I observe that the new synonym and word do get added, but the semantic pointer created (which identifies the hyponymy relation) is not saved. How do I relate the pointer to the dictionary?
JWNL.initialize(new FileInputStream(propsFile));
Dictionary dictionary = Dictionary.getInstance();
Iterator<Synset> synsets = dictionary.getSynsetIterator(POS.NOUN);
dictionary.edit();
Synset newSynset = new Synset(dictionary, POS.NOUN);
IndexWord newWord = new IndexWord(dictionary, "hublabooboo", POS.NOUN, newSynset);
Synset topmostSynset = synsets.next();
Pointer newPointer = new Pointer(PointerType.HYPONYM, topmostSynset, newSynset);
dictionary.save();
I'd suggest you add pointer to the synset's list of pointers:
topmostSynset.getPointers().add(newPointer);
If the pointer is symmetric (like hypernym, which has a mirror one: hyponym), and dictionary.getManageSymmetricPointers() then the reverse pointer (e.g. hyponym) is added automatically.
By the way, by this code Synset topmostSynset = synsets.next(); it looks like you infer that the first returned synset from the synset iterator is the "entity" one. But this is not guaranteed anywhere. This is dictionary-dependent: might work for file-based, but most likely won't for map-based and unpredictable for database-based.
Source : SourceForge
Related
I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).
What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.
What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).
How should I proceed ?
Here is (in Python) my custom implementation of the Analyzer :
class CustomAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName, reader):
source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
filter = StandardFilter(Version.LUCENE_4_10_1, source)
filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
filter = StopFilter(Version.LUCENE_4_10_1, filter,
StopAnalyzer.ENGLISH_STOP_WORDS_SET)
ts = tokenStream.getTokenStream()
token = ts.addAttribute(CharTermAttribute.class_)
offset = ts.addAttribute(OffsetAttribute.class_)
ts.reset()
while ts.incrementToken():
startOffset = offset.startOffset()
endOffset = offset.endOffset()
term = token.toString()
# accept or reject term
ts.end()
ts.close()
# How to store the terms in the index now ?
return ????
Thank you for your guidance in advance !
EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.
Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.
EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?
EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...
The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.
By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).
Here is the corresponding code snippet :
class MyFilter(PythonFilteringTokenFilter):
def __init__(self, version, tokenStream):
super(MyFilter, self).__init__(version, tokenStream)
self.termAtt = self.addAttribute(CharTermAttribute.class_)
def accept(self):
term = self.termAtt.toString()
accepted = False
# Do whatever is needed with the term
# accepted = ... (True/False)
return accepted
Then just append the filter to the other filters (as in the code snipped of the question) :
filter = MyFilter(Version.LUCENE_4_10_1, filter)
I hope you can help me...
How to get synonym of a word into an array using extended java word-net library
Waiting for your valuable response...
After looking over the API, it seems that synonyms in WordNet are refered to as Synsets.
Assuming you have already called System.setProperty("wordnet.database.dir", "<location_to_WordNet_database>/dict"), you can declare and initalize a WordNetDatabase like so:
WordNetDatabase database = WordNetDatabase.getFileInstance();
and then declare and initialize a Synset array:
Synset[] synsets = database.getSynsets("your word", SynsetType.<WORDTYPE>/*like NOUN, or VERB*/);
I'm assuming that setting SynsetType.NOUN as the second parameter would create an array of synonyms which are nouns only.
You could then declare a Synset which corresponds to the synset array which you just initalized (for example, if you called database.getSynsets("your word", SynsetType.NOUN), you would do this):
NounSynset nounSynset;
and finally you could iterate through your synsets array in a for loop, setting
nounSynset = (NounSynset) synsets[i];
and assign its primary word form to a String via
String currentSynonym = nounSynset.getWordForms()[0];
For more information, see Java API for WordNet main page and the documentation overview
I am absolutely new to Java development.
Can someone please elaborate on how to obtain "Grammatical Relations" using the Stanfords's Natural Language Processing Lexical Parser- open source Java code?
Thanks!
See line 88 of first file in my code to run the Stanford Parser programmatically
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
System.out.println("words: "+words);
System.out.println("POStags: "+tags);
System.out.println("stemmedWordsAndTags: "+stems);
System.out.println("typedDependencies: "+tdl);
The collection tdl is a list of these typed dependencies. If you look on the javadoc for TypedDependency you'll see that using the .reln() method gets you the grammatical relation.
Lines 311-318 of the third file in my code show how to use that list of typed dependencies. I happen to get the name of the relation, but you could get the relation itself, which would be of the class GrammaticalRelation.
for( Iterator<TypedDependency> iter = tdl.iterator(); iter.hasNext(); ) {
TypedDependency var = iter.next();
TreeGraphNode dep = var.dep();
TreeGraphNode gov = var.gov();
// All useful information for a node in the tree
String reln = var.reln().getShortName();
Don't feel bad, I spent a miserable day or two trying to figure out how to use the parser. I don't know if the docs have improved, but when I used it they were pretty damn awful.
I'm using Lucene's Highlighter to highlight parts of a string. The code below seems to work fine for finding the stemmed words but not for prefix matching.
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Query query = parser.parse(pQuery);
QueryScorer scorer = new QueryScorer(query);
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 40);
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);
String[] frags = highlighter.getBestFragments(analyzer, "", pText, 4);
I've read in a few different places I need to call Query.rewrite to get the prefix matching to work. That method takes an IndexReader arguement though and I'm not sure how to get it. All of the example's I've found that call Query.rewreite don't show where the IndexReader came from. I'll add that that this is the only Lucene code I'm using. I'm not using Lucene to do the searching itself, just for the highlighting.
How do I create an IndexReader and is it possible to create one if I'm using Lucene the way that I am. Or perhaps there's a different way to get it to highlight the prefix matches? I'm very new to Lucene and I'm sure what all of these pieces do or if they're all necessary. I've just copied them from various example's I've found online. So if I've doing anything else wrong please let me know. Thanks.
Suppose you have a query field:abc* . What query.rewrite basically does is: it reads the index(this why you need an IndexReader) finds all terms that start with abc and changes your query as ,for ex., field:abc1 field:abc2 field:abc3. If you know the location of the index, you can use IndexReader.Open to get an IndexReader. If you don't have an index at all, you should search your pText, find all words that start with abc and update your query accordingly.
i haven't found the answer to my problem so I decided to write my question to get some help.
I use lucene to index the objects in computer memory(they exist only in my java code). While processing the code I index (using WhitespaceAnalyzer) the field with value objA/4.
My problem starts when I want to find it after the indexation (also using WhitespaceAnalyzer).
When i create a query obj* , I find all objects that start with obj - if i create a query objA/4 I also can find this object.
However i don't know how to find all objects starting with objA/ , when I create a query objA/* lucene is changing it to obja/* and finds nothing.
I've checked and "/" is not a special character so i dont need any "\" preceding it.
So my question is how to ask to get all objects that starts with objA/ (for example - objA/0, objA/1, objA/2, objA/3)?
Are you using QueryParser.escape(String) to escape everything correctly?
The code i'm using:
String node = "objA/*";
Query node_query = MultiFieldQueryParser.parse(node, "nodeName", new WhitespaceAnalyzer());
BooleanQuery bq = new BooleanQuery();
bq.add(node_query, BooleanClause.Occur.MUST);
System.out.println("We're asking for - " + bq);
IndexSearcher looker = new IndexSearcher(rep_index);
Hits hits = looker.search(bq);