TSurgeon - relabel node using old value

TSurgeon - relabel node using old value - java

I am trying to implement TSurgeon on a standford parse tree (from the core-nlp api). What my intended action will do is add a prefix to the node that I find (e.g. the node found is NN and I would like to rename it to Skip-NN)
What I am trying is this:
TsurgeonPattern surgery = Tsurgeon.parseOperation("relabel target Skip-target");
for (TregexPattern pat : patterns) {
Tsurgeon.processPattern(pat, surgery, tree).pennPrint();
}
An example of one the TregexPattern's used would be NP << NP=target
Although as you might of guessed the result is similar to:
NP -> "Skip-target" instead of NP -> "Skip-NP"
I am quite new to using TSurgeon and am unsure as to where to look for information regarding an issue like this.
EDIT: Essentially what I'm asking; is there a way to use the current label of a node when relabeling it.

You should be able to use regexes for this. Something like
relabel target /^(.*)$/Skip-$1/
Though you will have to be careful with your pattern, it will have to ignore nodes beginning with Skip-.

Related

Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).
What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.
What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).
How should I proceed ?
Here is (in Python) my custom implementation of the Analyzer :
class CustomAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName, reader):
source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
filter = StandardFilter(Version.LUCENE_4_10_1, source)
filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
filter = StopFilter(Version.LUCENE_4_10_1, filter,
StopAnalyzer.ENGLISH_STOP_WORDS_SET)
ts = tokenStream.getTokenStream()
token = ts.addAttribute(CharTermAttribute.class_)
offset = ts.addAttribute(OffsetAttribute.class_)
ts.reset()
while ts.incrementToken():
startOffset = offset.startOffset()
endOffset = offset.endOffset()
term = token.toString()
# accept or reject term
ts.end()
ts.close()
# How to store the terms in the index now ?
return ????
Thank you for your guidance in advance !
EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.
Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.
EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?
EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...

The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.
By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).
Here is the corresponding code snippet :
class MyFilter(PythonFilteringTokenFilter):
def __init__(self, version, tokenStream):
super(MyFilter, self).__init__(version, tokenStream)
self.termAtt = self.addAttribute(CharTermAttribute.class_)
def accept(self):
term = self.termAtt.toString()
accepted = False
# Do whatever is needed with the term
# accepted = ... (True/False)
return accepted
Then just append the filter to the other filters (as in the code snipped of the question) :
filter = MyFilter(Version.LUCENE_4_10_1, filter)

unknown number of children in ANTLR tree

I am working on a parser for a calculator, which also needs to build a tree.
For example:
exp returns[Tree tree]e1=exp e2=operator e3=exp{
Tree tempTree = ($e2.tree);
tempTree.insertChild ($e1.tree);
tempTree.insertChild ($e3.tree);
$tree = tempTree;
}
I would like to know how can I build a tree for a multiple arguments function without assuming number of children.
For example: max(a,b,c,d,..)
I thought of using something like FUNCTION LEFTBRACKET exp (COMMA exp)* RIGHTBRACKET
but I am not sure about building the tree for the * expression

Something like:
FUNCTION: FUNCTION_NAME LEFTBRACKET PARAMETERS RIGHTBRACKET;
PARAMETERS: exp | exp COMMA PARAMEGERS;
may help.

What you did works fine, and the children will be put into a list that you can access via expr().

Java: How to execute an XPath query on a node

So I'm reading from an XML file with many layers of nesting in Java using xPath.
At the moment I have a method that takes the path to XML file and a xpath query as parameters, and returns a NodeIterator.
Then I iterate through those node, and for some of the nodes (if their name matches) I need to execute another query on them and get a NodeIterator of their children etc
Is it possible to have a function with 2 parameters, one an already existing Node and the other an xPath query to execute on that Node?
So replacing:NodeIterator ni = XPathAPI.selectNodeIterator(document,xpathQuery);
With some like : NodeIterator ni2 = xPathAPI.selectNodeIterator(parentNode, query);
I've searched on the internet and I can't find any examples, and I'm not sure what the syntax to do the above would be, or if it's even possible?
Many thanks in advance :)

Presumably your XPathAPI class is the Apache/Xalan org.apache.xpath.XPathAPI?
In that case, what's wrong with
static NodeIterator selectNodeIterator(Node contextNode, java.lang.String str)
It seems to do exactly what you want.

DOM Parser Example for Objects within Objects

So say I have an XML file that looks like this:
<Object1s>
<Object1>
<Field1></Field1>
<Object2s>
<Object2>
<Field1a></Field1a>
<Field1b></Field1b>
</Object2>
<Object2>
<Field1a></Field1a>
<Field1b></Field1b>
</Object2>
</Object2s>
</Object1>
<Object1>
<Field1></Field1>
<Object2s>
<Object2>
<Field1a></Field1a>
<Field1b></Field1b>
</Object2>
</Object2s>
</Object1>
</Object1s>
The DOM tutorials I've found have not worked when I try and do the same sort of thing. For instance, I want to be able to separate the Object2s by the Object1 that they are in. When following the example given by DOM tutorials where this type of thing doesn't exist in their XML files, I get all the Object2s that are in any Object1 when I try to find them.
Can someone show me an example that handles something like this?

Okay, figured it out. What I do is use the element I declare for each element, and within that call .getElementsBytagName() to get the elements within that element.

Convert a list of URLs to a Tree

Just to be sure I'm not reinventing the wheel, I want to see if there is some known algorithm, class, or something that can help me solve my problem. I have a huge list of URLs from an application. I'd like to feed those URLs into a tree to create a sitemap-like data structure.
It seems that something like this may have done before. However, everything I see from my searches appears to do it from xml to tree. Ideally I'd like to have answer in Java, but I'm sure I could translate it to Java myself if necessary. If I need to do it myself, I'd probablty take each URL and break them into indexes.
[root] [0] [1] [1] -file
wwe.site.com/dir1/dir2/file.html
[root] [0] [1] [1]
www.site.com/dirabc/dir2/file.html
So, I'd parse each url into offsets [0], [1], [2], … etc., and those be depth down in tree where to add them. That was at least my initial plan. I'm open to any and all suggestions!

You could define your UrlTree as nested HashMaps
public class UrlTree {
private final Map<String, UrlTree> branches = new HashMap<String, UrlTree>();
public void add(String[] tokens, int i) {
if (i >= tokens.length) {
return;
}
final String token = tokens[i];
UrlTree branch = branches.get(token);
if (branch == null) {
branch = new UrlTree();
branches.put(token, branch);
}
branch.add(tokens, i + 1);
}
...
}

You'll need to implement TreeModel in a way that reflects the hierarchy of your observed directory structure. FileTreeModel is an example, and ac.Name is a simple class that parses paths for a vintage file system. See also How to Use Trees. An instance of NetBeans Outline, illustrated here, would make a nice alternative view.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

TSurgeon - relabel node using old value - java

You should be able to use regexes for this. Something like relabel target /^(.*)$/Skip-$1/ Though you will have to be careful with your pattern, it will have to ignore nodes beginning with Skip-.

Related

Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

unknown number of children in ANTLR tree

Java: How to execute an XPath query on a node

DOM Parser Example for Objects within Objects

Convert a list of URLs to a Tree

Categories

Resources