I have started learning GATE application and I would like to use it to extract information from an unstructured document. The information I am interested in are date, location, event information and person’s names. I would like to get information about events that happened at a specific location on a specific date and the person/s name. I have been reading the GATE manual and thats how I got the glimpse on how to build your pipeline. However, I am not figuring out how I can create my new annotation types and make sure that they are annotated to a new annotation set which should appear under the annotation sets on the right. I found similar questions like GATE - How to create a new annotation SET? but it didn help me either.
Let me explain what I did so far:
Created .lst file for my new NE and put them under ANNIE resources/gazetteer directory
I added the .lst file description in the list.def file
I identified my patterns in the document e.g for Date formats like ddmm, dd.mm.yyyy
I wrote JAPE rule for each pattern in a separate .jape file
Added the JAPE file names into the main.jape file
Loaded the PR and my document into GATE
Run the application
This is how my JAPE Rule looks like for one date format:
Phase: datesearching
Input: Token Lookup SpaceToken
Options: control = appelt
////////////////////////////////////Macros
//Initialization of regular expressions
Macro: DAY_ONE
({Token.kind == number,Token.category==CD, Token.length == "1"})
Macro: C
({Token.kind == number,Token.category==CD, Token.length == "2"})
Macro: YEAR
({Token.kind == number,Token.category==CD, Token.length == "4"})
Macro: MONTH
({Lookup.majorType=="Month"})
Rule: ddmmyyydash
(
(DAY_ONE|DAY_TWO)
({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
(MONTH)
({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
(YEAR)
)
:ddmmyyyydash
-->
:ddmmyyyydash.DateMonthYearDash= {rule = "ddmmyyyydash"}
Can someone please help me with what I should do to make sure that DateMonthYearDash is created as a new annotation set? How do I do it? Thanks a lot.
When I change the outputAsName of the Jape Transducer the new set is not appearing like the rest. This is how it looks:
As said, linked or quoted in the question you mention (GATE - How to create a new annotation SET?), you have two options:
Change the outputASName of your JAPE transducer PR.
Use Annotation Set Transfer PR to copy or move desired annotations from one annotation set to another one.
JAPE function - explanation
JAPE transducer (similarly to many other GATE PRs) simply takes some input annotations and based on them it creates some new output annotations. The input and output annotation sets names can be configured by inputASName and outputASName run-time parameters. inputASName says where it should look for input annotations and outputASName says where it should put output annotations to.
What should be where
The input annotation set must contain the necessary input annotations before the JAPE transducer PR is executed. These annotations are usually created by preceding PRs in the pipeline. Otherwise it will not see the necessary input annotations and it will not produce anything.
The output annotation set may be empty or it may contain anything before the JAPE execution. It doesn't matter. The import thing is that the new output annotations (DateMonthYearDash in your case) are created there when the JAPE transducer PR execution finished. So after successful JAPE execution you should see the new annotations there.
Some terminology
Note that annotation sets have names.While annotations have type, id, offsets, features and annotation set they belong to.
JAPE correction
I found some issues in your JAPE grammar:
Don't include SpaceToken unless you explicitly use them in your grammar or you are sure there will be none inside the pattern... See also: Concept of Space Token in JAPE
({Lookup.majorType=="Month"}) -> ({Lookup.minorType=="month"})
(DAY_ONE|DAY_TWO) -> (DAY_ONE)
After corrections + after ANNIE pipeline for document 9 - January - 2017:
JAPE grammar after corrections:
Phase: datesearching
Input: Token Lookup
Options: control = appelt
Macro: DAY_ONE
({Token.kind == number,Token.category==CD, Token.length == "1"})
Macro: YEAR
({Token.kind == number,Token.category==CD, Token.length == "4"})
Macro: MONTH
({Lookup.minorType=="month"})
Rule: ddmmyyydash
(
(DAY_ONE)
({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
(MONTH)
({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
(YEAR)
)
:ddmmyyyydash
-->
:ddmmyyyydash.DateMonthYearDash= {rule = "ddmmyyyydash"}
What to do when JAPE does not produce anything
You have to investigate the input annotations and "debug" your JAPE grammar. Usually there is some expected input annotation missing or there is some extra annotation you did not expect to be there. There is a nice view in GATE for this purpose: annotation stack. Also some features of input annotations can have different name or value than you expected (e.g. What is correct: {Lookup.majorType=="Month"} or {Lookup.minorType=="month"}?).
By "debugging" a JAPE grammar I mean: try to simplify the rule as far as it starts working. Keep trying it on a simple document where it should match for sure. So in your case you can try it without the (DAY_ONE) part. If it still doesn't work, try only (MONTH)({Token.string == "-"})(YEAR), or even (MONTH) only, etc. Until you find the mistake in the grammar...
Related
I have couple of xmls which needs to be compared with different set of similar xml and while comparing i need to ignore tags based on a condition, for example
personal.xml - ignore fullname
address.xml - igone zipcode
contact.xml - ignore homephone
here is the code
Diff documentDiff=DiffBuilder
.compare(actualxmlfile)
.withTest(expectedxmlfile)
.withNodeFilter(node -> !node.getNodeName().equals("FullName"))
.ignoreWhitespace()
.build();
How can i add conditions at " .withNodeFilter(node -> !node.getNodeName().equals("FullName")) " or is there a smarter way to do this
You can join multiple conditions together using "and" (&&):
private static void doDemo1(File actual, File expected) {
Diff docDiff = DiffBuilder
.compare(actual)
.withTest(expected)
.withNodeFilter(
node -> !node.getNodeName().equals("FullName")
&& !node.getNodeName().equals("ZipCode")
&& !node.getNodeName().equals("HomePhone")
)
.ignoreWhitespace()
.build();
System.out.println(docDiff.toString());
}
If you want to keep your builder tidy, you can move the node filter to a separate method:
private static void doDemo2(File actual, File expected) {
Diff docDiff = DiffBuilder
.compare(actual)
.withTest(expected)
.withNodeFilter(node -> testNode(node))
.ignoreWhitespace()
.build();
System.out.println(docDiff.toString());
}
private static boolean testNode(Node node) {
return !node.getNodeName().equals("FullName")
&& !node.getNodeName().equals("ZipCode")
&& !node.getNodeName().equals("HomePhone");
}
The risk with this is you may have element names which appear in more than one type of file - where that node needs to be filtered from one type of file, but not any others.
In this case, you would also need to take into account the type of file you are handling. For example, you can use the file names (if they follow a suitable naming convention) or use the root elements (assuming they are different) - such as <Personal>, <Address>, <Contact> - or whatever they are, in your case.
However, if you need to distinguish between XML file types, for this reason, you may be better off using that information to have separate DiffBuilder objects, with different filters. That may result in clearer code.
I had provided the separate method in the below link for !node.getNodeName().equals("FullName")(which you are using in your code), I think by using that separate method you can just pass the array of nodes which you want to ignore and see the results. And incase you wish to add any other conditions based on your requirement, you can try and play in this method.
https://stackoverflow.com/a/68099435/13451711
I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).
What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.
What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).
How should I proceed ?
Here is (in Python) my custom implementation of the Analyzer :
class CustomAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName, reader):
source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
filter = StandardFilter(Version.LUCENE_4_10_1, source)
filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
filter = StopFilter(Version.LUCENE_4_10_1, filter,
StopAnalyzer.ENGLISH_STOP_WORDS_SET)
ts = tokenStream.getTokenStream()
token = ts.addAttribute(CharTermAttribute.class_)
offset = ts.addAttribute(OffsetAttribute.class_)
ts.reset()
while ts.incrementToken():
startOffset = offset.startOffset()
endOffset = offset.endOffset()
term = token.toString()
# accept or reject term
ts.end()
ts.close()
# How to store the terms in the index now ?
return ????
Thank you for your guidance in advance !
EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.
Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.
EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?
EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...
The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.
By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).
Here is the corresponding code snippet :
class MyFilter(PythonFilteringTokenFilter):
def __init__(self, version, tokenStream):
super(MyFilter, self).__init__(version, tokenStream)
self.termAtt = self.addAttribute(CharTermAttribute.class_)
def accept(self):
term = self.termAtt.toString()
accepted = False
# Do whatever is needed with the term
# accepted = ... (True/False)
return accepted
Then just append the filter to the other filters (as in the code snipped of the question) :
filter = MyFilter(Version.LUCENE_4_10_1, filter)
I was wondering is someone can help me out here. I think this could be of use for anyone trying to conduct machine learning on GATE (General Architecture for Text Engineering). So basically to conduct machine learning I first need to add some code to a few jape files so my output XML file would print out the Annotation Id value as a feature. An example is provided below:
<Annotation Id="1491" Type="Person" StartNode="288" EndNode="301">
<Feature>
<Name className="java.lang.String">id</Name>
<Value className="java.lang.String">1491</Value>
</Feature>
(Note that the feature value of 1491 matches the Annotation Id="1491". This is what I want.)
WHY I NEED THIS:
I am conducting machine learning on a plain text document that initially contains no annotation. I am using the June 2012 training course that is on the GATE website as a guide while doing this. I am specifically following the Module 11: Relations tutorial (it finds employment relationships between person and organization). I utilize the corpus of 93 pre-annotated documents for training and then apply that learned module on my document. But first I run my document through ANNIE. It creates many annotations and features but not everything that I need for machine learning. I've learned through trial/error and investigation that my annotated document must contain features with the Annotation Id for every "Person" and "Organization" type. I recognize that the configuration file (relations-config.xml) that is used in the Batch Learning PR looks for id features for "Person" and "Organization" types. It will not run if these ID features are not present. So I add this manually and then run it through the machine learning "APPLICATION" mode. It works rather nicely. However I clearly do not want to add the id features to my XML file manually every time.
WHAT I HAVE FIGURED OUT WITH THE GATE CODE:
I believe I have found the code files (final.jape, org_context.jape and name_context.jape) that I need to alter so they can add that id feature to every annotation that contains "Person" and "Organization". I don't understand the language that GATE uses very well (I'm a mechanical engineer, not a software engineer) and this is probably why I can't figure this out (Ha!). Anyhow, I could be off and may need to add a few more lines in for the jape file to work properly, but I feel like I've pinpointed it pretty closely. There are two sections of code that are similar but slightly different, which are currently the bane of my existence. The first one goes through an iterator loop, the second one does not. I copy/pasted those 2 those below with a line stating WHAT_DO_I_PUT_HERE that indicate where I think my problem and solution lies. I would be very grateful if someone can help me with what I need to write to get my result.
Thank you!
- Todd
//////////// First section of code ////////////////
Rule: PersonFinal
Priority: 30
//({JobTitle}
//)?
(
{TempPerson.kind == personName}
)
:person
-->
{
gate.FeatureMap features = Factory.newFeatureMap();
gate.AnnotationSet personSet = (gate.AnnotationSet)bindings.get("person");
gate.Annotation person1Ann = (gate.Annotation)personSet.iterator().next();
gate.AnnotationSet firstPerson = (gate.AnnotationSet)personSet.get("TempPerson");
if (firstPerson != null && firstPerson.size()>0)
{
gate.Annotation personAnn = (gate.Annotation)firstPerson.iterator().next();
if (personAnn.getFeatures().containsKey("gender")) features.put("gender", personAnn.getFeatures().get("gender"));
}
features.put("id", WHAT_DO_I_PUT_HERE.getId().toString());
features.put("rule1", person1Ann.getFeatures().get("rule"));
features.put("rule", "PersonFinal");
outputAS.add(personSet.firstNode(), personSet.lastNode(), "Person", features);
outputAS.removeAll(personSet);
}
//////////// Second section of code ////////////////
Rule:OrgContext1
Priority: 1
// company X
// company called X
(
{Token.string == "company"}
(({Token.string == "called"}|
{Token.string == "dubbed"}|
{Token.string == "named"}
)
)?
)
(
{Unknown.kind == PN}
)
:org
-->
{
gate.AnnotationSet org = (gate.AnnotationSet) bindings.get("org");
gate.FeatureMap features = Factory.newFeatureMap();
features.put("id", WHAT_DO_I_PUT_HERE.getId().toString());
features.put("rule ", "OrgContext1");
outputAS.add(org.firstNode(), org.lastNode(), "Organization", features);
outputAS.removeAll(org);
}
You cannot access the annotation id before the actual annotation is created. My solution of this problem:
Rule:PojemId
(
{PojemD}
):pojem
-->
{
AnnotationSet matchedAnns = bindings.get("pojem");
Annotation ann = matchedAnns.get("PojemD").iterator().next();
FeatureMap pojemFeatures = ann.getFeatures();
gate.FeatureMap features = Factory.newFeatureMap();
features.putAll(pojemFeatures);
features.put("annId", ann.getId());
inputAS.remove(ann);
Integer id = outputAS.add(matchedAnns.firstNode(), matchedAnns.lastNode(), "PojemD", features);
features.put("id", id);
}
It's quite simple. You have to mark the annotation on the Right Hand Side (RHS) of the rule by some label (token_match in my example bellow) and then, on the Left Hand Side (LHS) of the rule, just obtain corresponding AnnotationSet form bindings variable and iterate through annotations (usually there is only a single annotation in it) and copy corresponding IDs to the output.
Phase: Main
Input: Token
Rule: WriteTokenID
(
({Token}):token_match
)
-->
{
AnnotationSet as = bindings.get("token_match");
for (Annotation a : as)
{
FeatureMap features = Factory.newFeatureMap();
features.put("origTokenId", a.getId());
outputAS.add(a.getStartNode(), a.getEndNode(), "NewToken", features);
}
}
In your code, you probably want to mark {TempPerson.kind == personName}and {Unknown.kind == PN} somehow like bellow.
(
({TempPerson.kind == personName}):temp_person
)
:person
and
(
{Token.string == "company"}
(({Token.string == "called"}|
{Token.string == "dubbed"}|
{Token.string == "named"}
)
)?
)
(
({Unknown.kind == PN}):unknown_org
)
:org
And them use bindings.get("temp_person") and bindings.get("unknown_org") respectively.
I'm trying to implement a tool for merging different versions of some source code. Given two versions of the same source code, the idea would be to parse them, generate the respective Abstract Source Trees (AST), and finally merge them into a single output source keeping grammatical consistency - the lexer and parser are those of question ANTLR: How to skip multiline comments.
I know there is class ParserRuleReturnScope that helps... but getStop() and getStart() always return null :-(
Here is a snippet that illustrates how I modified my perser to get rules printed:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
#members {
private void log(ParserRuleReturnScope rule) {
System.out.println("Rule: " + rule.getClass().getName());
System.out.println(" getStart(): " + rule.getStart());
System.out.println(" getStop(): " + rule.getStop());
System.out.println(" getTree(): " + rule.getTree());
}
}
parse
: codeTabHeader codeTable endCodeTable eof { log(retval); }
;
codeTabHeader
: comment CodeTabHeader^ { log(retval); }
;
...
Assuming you have the ASTs (often difficult to get in the first place, parsing real languages is often harder than it looks), you first have to determine what they have in common, and build a mapping collecting that information. That's not as easy as it looks; do you count a block of code that has moved, but is the same exact subtree, as "common"? What about two subtrees that are the same except for consistent renaming of an identifier? What about changed comments? (most ASTs lose the comments; most programmers will think this is a really bad idea).
You can build a variation of the "Longest Common Substring" algorithm to compare trees. I've used that in tools that I have built.
Finally, after you've merged the trees, now you need to regenerate the text, ideally preserving most of the layout of the original code. (Programmers hate when you change the layout they so loving produced). So your ASTs need to capture position information, and your regeneration has to honor that where it can.
The call to log(retval) in your parser code looks like it's going to happen at the end of the rule, but it's not. You'll want to move the call into an #after block.
I changed log to spit out a message as well as the scope information and added calls to it to my own grammar like so:
script
#init {log("#init", retval);}
#after {log("#after", retval);}
: statement* EOF {log("after last rule reference", retval);}
-> ^(STMTS statement*)
;
Parsing test input produced the following output:
Logging from #init
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from after last rule reference
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from #after
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): [#4,15:15='<EOF>',<-1>,1:15]
getTree(): STMTS
The call in the after block has both the stop and tree fields populated.
I can't say whether this will help you with your merging tool, but I think this will at least get you past the problem with the half-populated scope object.
I am trying to build an ANTLR grammar that parses tagged sentences such as:
DT The NP cat VB ate DT a NP rat
and have the grammar:
fragment TOKEN : (('A'..'Z') | ('a'..'z'))+;
fragment WS : (' ' | '\t')+;
WSX : WS;
DTTOK : ('DT' WS TOKEN);
NPTOK : ('NP' WS TOKEN);
nounPhrase: (DTTOK WSX NPTOK);
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase+")");};
The grammar generator generates the "missing attribute access on rule scope: nounPhrase" in the last line.
[I am still new to ANTLR and although some grammars work it's still trial and error. I also frequently get an "OutOfMemory" error when running grammars as small as this - any help welcome.]
I am using ANTLRWorks 1.3 to generate the code and am running under Java 1.6.
"missing attribute access" means that you've referenced a scope ($nounPhrase) rather than an attribute of the scope (such as $nounPhrase.text).
In general, a good way to troubleshoot problems with attributes is to look at the generated parser method for the rule in question.
For example, my initial attempt at creating a new rule when I was a little rusty:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a.value); names.add($b.value); };
resulted in "unknown attribute for rule fullname". So I tried
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a); names.add($b); };
which results in "missing attribute access". Looking at the generated parser method made it clear what I needed to do though. While there are some cryptic pieces, the parts relevant to scopes (variables) are easily understood:
public final List<Name> multiple_names() throws RecognitionException {
List<Name> names = null; // based on "returns" clause of rule definition
Name a = null; // based on scopes declared in rule definition
Name b = null; // based on scopes declared in rule definition
names = new ArrayList<Name>(4); // snippet inserted from `#init` block
try {
pushFollow(FOLLOW_fullname_in_multiple_names42);
a=fullname();
state._fsp--;
match(input,189,FOLLOW_189_in_multiple_names44);
pushFollow(FOLLOW_fullname_in_multiple_names48);
b=fullname();
state._fsp--;
names.add($a); names.add($b);// code inserted from {...} block
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
finally {
// do for sure before leaving
}
return names; // based on "returns" clause of rule definition
}
After looking at the generated code, it's easy to see that the fullname rule is returning instances of the Name class, so what I needed in this case was simply:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add(a); names.add(b); };
The version you need in your situation may be different, but you'll generally be able to figure it out pretty easily by looking at the generated code.
In the original grammer, why not include the attribute it is asking for, most likely:
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase.text+")");};
Each of your rules (chunker being the one I can spot quickly) have attributes (extra information) associated with them. You can find a quick list of the different attributes for the different types of rules at http://www.antlr.org/wiki/display/ANTLR3/Attribute+and+Dynamic+Scopes, would be nice if descriptions were put on the web page for each of those attributes (like for the start and stop attribute for the parser rules refer to tokens from your lexer - which would allow you to get back to your line number and position).
I think your chunker rule should just be changed slightly, instead of $nounPhrase you should use $nounPhrase.text. text is an attribute for your nounPhrase rule.
You might want to do a little other formating as well, usually the parser rules (start with lowercase letter) appear before the lexer rules (start with uppercase letter)
PS. When I type in the box the chunker rule is starting on a new line but in my original answer it didn't start on a new line.
If you accidentally do something silly like $thing.$attribute where you mean $thing.attribute, you will also see the missing attribute access on rule scope error message. (I know this question was answered a long time ago, but this bit of trivia might help someone else who sees the error message!)
Answering question after having found a better way...
WS : (' '|'\t')+;
TOKEN : (('A'..'Z') | ('a'..'z'))+;
dttok : 'DT' WS TOKEN;
nntok : 'NN' WS TOKEN;
nounPhrase : (dttok WS nntok);
chunker : nounPhrase ;
The problem was I was getting muddled between the lexer and the parser (this is apparently very common). The uppercase items are lexical, the lowercase in the parser. This now seems to work. (NB I have changed NP to NN).