GATE write annotation ID as a feature - java

I was wondering is someone can help me out here. I think this could be of use for anyone trying to conduct machine learning on GATE (General Architecture for Text Engineering). So basically to conduct machine learning I first need to add some code to a few jape files so my output XML file would print out the Annotation Id value as a feature. An example is provided below:
<Annotation Id="1491" Type="Person" StartNode="288" EndNode="301">
<Feature>
<Name className="java.lang.String">id</Name>
<Value className="java.lang.String">1491</Value>
</Feature>
(Note that the feature value of 1491 matches the Annotation Id="1491". This is what I want.)
WHY I NEED THIS:
I am conducting machine learning on a plain text document that initially contains no annotation. I am using the June 2012 training course that is on the GATE website as a guide while doing this. I am specifically following the Module 11: Relations tutorial (it finds employment relationships between person and organization). I utilize the corpus of 93 pre-annotated documents for training and then apply that learned module on my document. But first I run my document through ANNIE. It creates many annotations and features but not everything that I need for machine learning. I've learned through trial/error and investigation that my annotated document must contain features with the Annotation Id for every "Person" and "Organization" type. I recognize that the configuration file (relations-config.xml) that is used in the Batch Learning PR looks for id features for "Person" and "Organization" types. It will not run if these ID features are not present. So I add this manually and then run it through the machine learning "APPLICATION" mode. It works rather nicely. However I clearly do not want to add the id features to my XML file manually every time.
WHAT I HAVE FIGURED OUT WITH THE GATE CODE:
I believe I have found the code files (final.jape, org_context.jape and name_context.jape) that I need to alter so they can add that id feature to every annotation that contains "Person" and "Organization". I don't understand the language that GATE uses very well (I'm a mechanical engineer, not a software engineer) and this is probably why I can't figure this out (Ha!). Anyhow, I could be off and may need to add a few more lines in for the jape file to work properly, but I feel like I've pinpointed it pretty closely. There are two sections of code that are similar but slightly different, which are currently the bane of my existence. The first one goes through an iterator loop, the second one does not. I copy/pasted those 2 those below with a line stating WHAT_DO_I_PUT_HERE that indicate where I think my problem and solution lies. I would be very grateful if someone can help me with what I need to write to get my result.
Thank you!
- Todd
//////////// First section of code ////////////////
Rule: PersonFinal
Priority: 30
//({JobTitle}
//)?
(
{TempPerson.kind == personName}
)
:person
-->
{
gate.FeatureMap features = Factory.newFeatureMap();
gate.AnnotationSet personSet = (gate.AnnotationSet)bindings.get("person");
gate.Annotation person1Ann = (gate.Annotation)personSet.iterator().next();
gate.AnnotationSet firstPerson = (gate.AnnotationSet)personSet.get("TempPerson");
if (firstPerson != null && firstPerson.size()>0)
{
gate.Annotation personAnn = (gate.Annotation)firstPerson.iterator().next();
if (personAnn.getFeatures().containsKey("gender")) features.put("gender", personAnn.getFeatures().get("gender"));
}
features.put("id", WHAT_DO_I_PUT_HERE.getId().toString());
features.put("rule1", person1Ann.getFeatures().get("rule"));
features.put("rule", "PersonFinal");
outputAS.add(personSet.firstNode(), personSet.lastNode(), "Person", features);
outputAS.removeAll(personSet);
}
//////////// Second section of code ////////////////
Rule:OrgContext1
Priority: 1
// company X
// company called X
(
{Token.string == "company"}
(({Token.string == "called"}|
{Token.string == "dubbed"}|
{Token.string == "named"}
)
)?
)
(
{Unknown.kind == PN}
)
:org
-->
{
gate.AnnotationSet org = (gate.AnnotationSet) bindings.get("org");
gate.FeatureMap features = Factory.newFeatureMap();
features.put("id", WHAT_DO_I_PUT_HERE.getId().toString());
features.put("rule ", "OrgContext1");
outputAS.add(org.firstNode(), org.lastNode(), "Organization", features);
outputAS.removeAll(org);
}

You cannot access the annotation id before the actual annotation is created. My solution of this problem:
Rule:PojemId
(
{PojemD}
):pojem
-->
{
AnnotationSet matchedAnns = bindings.get("pojem");
Annotation ann = matchedAnns.get("PojemD").iterator().next();
FeatureMap pojemFeatures = ann.getFeatures();
gate.FeatureMap features = Factory.newFeatureMap();
features.putAll(pojemFeatures);
features.put("annId", ann.getId());
inputAS.remove(ann);
Integer id = outputAS.add(matchedAnns.firstNode(), matchedAnns.lastNode(), "PojemD", features);
features.put("id", id);
}

It's quite simple. You have to mark the annotation on the Right Hand Side (RHS) of the rule by some label (token_match in my example bellow) and then, on the Left Hand Side (LHS) of the rule, just obtain corresponding AnnotationSet form bindings variable and iterate through annotations (usually there is only a single annotation in it) and copy corresponding IDs to the output.
Phase: Main
Input: Token
Rule: WriteTokenID
(
({Token}):token_match
)
-->
{
AnnotationSet as = bindings.get("token_match");
for (Annotation a : as)
{
FeatureMap features = Factory.newFeatureMap();
features.put("origTokenId", a.getId());
outputAS.add(a.getStartNode(), a.getEndNode(), "NewToken", features);
}
}
In your code, you probably want to mark {TempPerson.kind == personName}and {Unknown.kind == PN} somehow like bellow.
(
({TempPerson.kind == personName}):temp_person
)
:person
and
(
{Token.string == "company"}
(({Token.string == "called"}|
{Token.string == "dubbed"}|
{Token.string == "named"}
)
)?
)
(
({Unknown.kind == PN}):unknown_org
)
:org
And them use bindings.get("temp_person") and bindings.get("unknown_org") respectively.

Related

Parse a single POJO from multiple YAML documents representing different classes

I want to use a single YAML file which contains several different objects - for different applications. I need to fetch one object to get an instance of MyClass1, ignoring the rest of docs for MyClass2, MyClass3, etc. Some sort of selective de-serializing: now this class, then that one... The structure of MyClass2, MyClass3 is totally unknown to the application working with MyClass1. The file is always a valid YAML, of course.
The YAML may be of any structure we need to implement such a multi-class container. The preferred parsing tool is snakeyaml.
Is it sensible? How can I ignore all but one object?
UPD: replaced all "document" with "object". I think we have to speak about the single YAML document containing several objects of different structure. More of it, the parser knows exactly only 1 structure and wants to ignore the rest.
UDP2: I think it is impossible with snakeyaml. We have to read all objects anyway - and select the needed one later. But maybe I'm wrong.
UPD2: sample config file
---
-
exportConfiguration781:
attachmentFieldName: "name"
baseSftpInboxPath: /home/user/somedir/
somebool: false
days: 9999
expected:
- ABC w/o quotes
- "Cat ABC"
- "Some string"
dateFormat: yyyy-MMdd-HHmm
user: someuser
-
anotherConfiguration:
k1: v1
k2:
- v21
- v22
This is definitely possible with SnakeYAML, albeit not trivial. Here's a general rundown what you need to do:
First, let's have a look what loading with SnakeYAML does. Here's the important part of the YAML class:
private Object loadFromReader(StreamReader sreader, Class<?> type) {
Composer composer = new Composer(new ParserImpl(sreader), resolver, loadingConfig);
constructor.setComposer(composer);
return constructor.getSingleData(type);
}
The composer parses YAML input into Nodes. To do that, it doesn't need any knowledge about the structure of your classes, since every node is either a ScalarNode, a SequenceNode or a MappingNode and they just represent the YAML structure.
The constructor takes a root node generated by the composer and generates native POJOs from it. So what you want to do is to throw away parts of the node graph before they reach the constructor.
The easiest way to do that is probably to derive from Composer and override two methods like this:
public class MyComposer extends Composer {
private final int objIndex;
public MyComposer(Parser parser, Resolver resolver, int objIndex) {
super(parser, resolver);
this.objIndex = objIndex;
}
public MyComposer(Parser parser, Resolver resolver, LoaderOptions loadingConfig, int objIndex) {
super(parser, resolver, loadingConfig);
this.objIndex = objIndex;
}
#Override
public Node getNode() {
return strip(super.getNode());
}
private Node strip(Node input) {
return ((SequenceNode)input).getValue().get(objIndex);
}
}
The strip implementation is just an example. In this case, I assumed your YAML looks like this (object content is arbitrary):
- {first: obj}
- {second: obj}
- {third: obj}
And you simply select the object you actually want to deserialize by its index in the sequence. But you can also have something more complex like a searching algorithm.
Now that you have your own composer, you can do
Constructor constructor = new Constructor();
// assuming we want to get the object at index 1 (i.e. second object)
Composer composer = new MyComposer(new ParserImpl(sreader), new Resolver(), 1);
constructor.setComposer(composer);
MyObject result = (MyObject)constructor.getSingleData(MyObject.class);
The answer of #flyx was very helpful for me, opening the way to workaround the library (in our case - snakeyaml) limitations by overriding some methods. Thanks a lot! It's quite possible there is a final solution in it - but not now. Besides, the simple solution below is robust and should be considered even if we'd found the complete library-intruding solution.
I've decided to solve the task by double distilling, sorry, processing the configuration file. Imagine the latter consisting of several parts and every part is marked by the unique token-delimiter. For the sake of keeping the YAML-likenes, it may be
---
#this is a unique key for the configuration A
<some YAML document>
---
#this is another key for the configuration B
<some YAML document
The first pass is pre-processing. For the given String fileString and String key (and DELIMITER = "\n---\n". for example) we select a substring with the key-defined configuration:
int begIndex;
do {
begIndex= fileString.indexOf(DELIMITER);
if (begIndex == -1) {
break;
}
if (fileString.startsWith(DELIMITER + key, begIndex)) {
fileString = fileString.substring(begIndex + DELIMITER.length() + key.length());
break;
}
// spoil alien delimiter and repeat search
fileString = fileString.replaceFirst(DELIMITER, " ");
} while (true);
int endIndex = fileString.indexOf(DELIMITER);
if (endIndex != -1) {
fileString = fileString.substring(0, endIndex);
}
Now we feed the fileString to the simple YAML parsing
ExportConfiguration configuration = new Yaml(new Constructor(ExportConfiguration.class))
.loadAs(fileString, ExportConfiguration.class);
This time we have a single document that must co-respond to the ExportConfiguration class.
Note 1: The structure and even the very content of the rest of configuration file plays absolutely no role. This was the main idea, to get independent configurations in a single file
Note 2: the rest of configurations may be JSON or XML or whatever. We have a method-preprocessor that returns a String configuration - and the next processor parses it properly.

Creating new annotation sets in GATE

I have started learning GATE application and I would like to use it to extract information from an unstructured document. The information I am interested in are date, location, event information and person’s names. I would like to get information about events that happened at a specific location on a specific date and the person/s name. I have been reading the GATE manual and thats how I got the glimpse on how to build your pipeline. However, I am not figuring out how I can create my new annotation types and make sure that they are annotated to a new annotation set which should appear under the annotation sets on the right. I found similar questions like GATE - How to create a new annotation SET? but it didn help me either.
Let me explain what I did so far:
Created .lst file for my new NE and put them under ANNIE resources/gazetteer directory
I added the .lst file description in the list.def file
I identified my patterns in the document e.g for Date formats like ddmm, dd.mm.yyyy
I wrote JAPE rule for each pattern in a separate .jape file
Added the JAPE file names into the main.jape file
Loaded the PR and my document into GATE
Run the application
This is how my JAPE Rule looks like for one date format:
Phase: datesearching
Input: Token Lookup SpaceToken
Options: control = appelt
////////////////////////////////////Macros
//Initialization of regular expressions
Macro: DAY_ONE
({Token.kind == number,Token.category==CD, Token.length == "1"})
Macro: C
({Token.kind == number,Token.category==CD, Token.length == "2"})
Macro: YEAR
({Token.kind == number,Token.category==CD, Token.length == "4"})
Macro: MONTH
({Lookup.majorType=="Month"})
Rule: ddmmyyydash
(
(DAY_ONE|DAY_TWO)
({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
(MONTH)
({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
(YEAR)
)
:ddmmyyyydash
-->
:ddmmyyyydash.DateMonthYearDash= {rule = "ddmmyyyydash"}
Can someone please help me with what I should do to make sure that DateMonthYearDash is created as a new annotation set? How do I do it? Thanks a lot.
When I change the outputAsName of the Jape Transducer the new set is not appearing like the rest. This is how it looks:
As said, linked or quoted in the question you mention (GATE - How to create a new annotation SET?), you have two options:
Change the outputASName of your JAPE transducer PR.
Use Annotation Set Transfer PR to copy or move desired annotations from one annotation set to another one.
JAPE function - explanation
JAPE transducer (similarly to many other GATE PRs) simply takes some input annotations and based on them it creates some new output annotations. The input and output annotation sets names can be configured by inputASName and outputASName run-time parameters. inputASName says where it should look for input annotations and outputASName says where it should put output annotations to.
What should be where
The input annotation set must contain the necessary input annotations before the JAPE transducer PR is executed. These annotations are usually created by preceding PRs in the pipeline. Otherwise it will not see the necessary input annotations and it will not produce anything.
The output annotation set may be empty or it may contain anything before the JAPE execution. It doesn't matter. The import thing is that the new output annotations (DateMonthYearDash in your case) are created there when the JAPE transducer PR execution finished. So after successful JAPE execution you should see the new annotations there.
Some terminology
Note that annotation sets have names.While annotations have type, id, offsets, features and annotation set they belong to.
JAPE correction
I found some issues in your JAPE grammar:
Don't include SpaceToken unless you explicitly use them in your grammar or you are sure there will be none inside the pattern... See also: Concept of Space Token in JAPE
({Lookup.majorType=="Month"}) -> ({Lookup.minorType=="month"})
(DAY_ONE|DAY_TWO) -> (DAY_ONE)
After corrections + after ANNIE pipeline for document 9 - January - 2017:
JAPE grammar after corrections:
Phase: datesearching
Input: Token Lookup
Options: control = appelt
Macro: DAY_ONE
({Token.kind == number,Token.category==CD, Token.length == "1"})
Macro: YEAR
({Token.kind == number,Token.category==CD, Token.length == "4"})
Macro: MONTH
({Lookup.minorType=="month"})
Rule: ddmmyyydash
(
(DAY_ONE)
({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
(MONTH)
({Token.string == ","}|{Token.string == "."} |{Token.string == "-"})
(YEAR)
)
:ddmmyyyydash
-->
:ddmmyyyydash.DateMonthYearDash= {rule = "ddmmyyyydash"}
What to do when JAPE does not produce anything
You have to investigate the input annotations and "debug" your JAPE grammar. Usually there is some expected input annotation missing or there is some extra annotation you did not expect to be there. There is a nice view in GATE for this purpose: annotation stack. Also some features of input annotations can have different name or value than you expected (e.g. What is correct: {Lookup.majorType=="Month"} or {Lookup.minorType=="month"}?).
By "debugging" a JAPE grammar I mean: try to simplify the rule as far as it starts working. Keep trying it on a simple document where it should match for sure. So in your case you can try it without the (DAY_ONE) part. If it still doesn't work, try only (MONTH)({Token.string == "-"})(YEAR), or even (MONTH) only, etc. Until you find the mistake in the grammar...

How to merge two ASTs?

I'm trying to implement a tool for merging different versions of some source code. Given two versions of the same source code, the idea would be to parse them, generate the respective Abstract Source Trees (AST), and finally merge them into a single output source keeping grammatical consistency - the lexer and parser are those of question ANTLR: How to skip multiline comments.
I know there is class ParserRuleReturnScope that helps... but getStop() and getStart() always return null :-(
Here is a snippet that illustrates how I modified my perser to get rules printed:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
#members {
private void log(ParserRuleReturnScope rule) {
System.out.println("Rule: " + rule.getClass().getName());
System.out.println(" getStart(): " + rule.getStart());
System.out.println(" getStop(): " + rule.getStop());
System.out.println(" getTree(): " + rule.getTree());
}
}
parse
: codeTabHeader codeTable endCodeTable eof { log(retval); }
;
codeTabHeader
: comment CodeTabHeader^ { log(retval); }
;
...
Assuming you have the ASTs (often difficult to get in the first place, parsing real languages is often harder than it looks), you first have to determine what they have in common, and build a mapping collecting that information. That's not as easy as it looks; do you count a block of code that has moved, but is the same exact subtree, as "common"? What about two subtrees that are the same except for consistent renaming of an identifier? What about changed comments? (most ASTs lose the comments; most programmers will think this is a really bad idea).
You can build a variation of the "Longest Common Substring" algorithm to compare trees. I've used that in tools that I have built.
Finally, after you've merged the trees, now you need to regenerate the text, ideally preserving most of the layout of the original code. (Programmers hate when you change the layout they so loving produced). So your ASTs need to capture position information, and your regeneration has to honor that where it can.
The call to log(retval) in your parser code looks like it's going to happen at the end of the rule, but it's not. You'll want to move the call into an #after block.
I changed log to spit out a message as well as the scope information and added calls to it to my own grammar like so:
script
#init {log("#init", retval);}
#after {log("#after", retval);}
: statement* EOF {log("after last rule reference", retval);}
-> ^(STMTS statement*)
;
Parsing test input produced the following output:
Logging from #init
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from after last rule reference
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): null
getTree(): null
Logging from #after
getStart(): [#0,0:4='Print',<10>,1:0]
getStop(): [#4,15:15='<EOF>',<-1>,1:15]
getTree(): STMTS
The call in the after block has both the stop and tree fields populated.
I can't say whether this will help you with your merging tool, but I think this will at least get you past the problem with the half-populated scope object.

ANTLR: "missing attribute access on rule scope" problem

I am trying to build an ANTLR grammar that parses tagged sentences such as:
DT The NP cat VB ate DT a NP rat
and have the grammar:
fragment TOKEN : (('A'..'Z') | ('a'..'z'))+;
fragment WS : (' ' | '\t')+;
WSX : WS;
DTTOK : ('DT' WS TOKEN);
NPTOK : ('NP' WS TOKEN);
nounPhrase: (DTTOK WSX NPTOK);
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase+")");};
The grammar generator generates the "missing attribute access on rule scope: nounPhrase" in the last line.
[I am still new to ANTLR and although some grammars work it's still trial and error. I also frequently get an "OutOfMemory" error when running grammars as small as this - any help welcome.]
I am using ANTLRWorks 1.3 to generate the code and am running under Java 1.6.
"missing attribute access" means that you've referenced a scope ($nounPhrase) rather than an attribute of the scope (such as $nounPhrase.text).
In general, a good way to troubleshoot problems with attributes is to look at the generated parser method for the rule in question.
For example, my initial attempt at creating a new rule when I was a little rusty:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a.value); names.add($b.value); };
resulted in "unknown attribute for rule fullname". So I tried
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add($a); names.add($b); };
which results in "missing attribute access". Looking at the generated parser method made it clear what I needed to do though. While there are some cryptic pieces, the parts relevant to scopes (variables) are easily understood:
public final List<Name> multiple_names() throws RecognitionException {
List<Name> names = null; // based on "returns" clause of rule definition
Name a = null; // based on scopes declared in rule definition
Name b = null; // based on scopes declared in rule definition
names = new ArrayList<Name>(4); // snippet inserted from `#init` block
try {
pushFollow(FOLLOW_fullname_in_multiple_names42);
a=fullname();
state._fsp--;
match(input,189,FOLLOW_189_in_multiple_names44);
pushFollow(FOLLOW_fullname_in_multiple_names48);
b=fullname();
state._fsp--;
names.add($a); names.add($b);// code inserted from {...} block
}
catch (RecognitionException re) {
reportError(re);
recover(input,re);
}
finally {
// do for sure before leaving
}
return names; // based on "returns" clause of rule definition
}
After looking at the generated code, it's easy to see that the fullname rule is returning instances of the Name class, so what I needed in this case was simply:
multiple_names returns [List<Name> names]
#init {
names = new ArrayList<Name>(4);
}
: a=fullname ' AND ' b=fullname { names.add(a); names.add(b); };
The version you need in your situation may be different, but you'll generally be able to figure it out pretty easily by looking at the generated code.
In the original grammer, why not include the attribute it is asking for, most likely:
chunker : nounPhrase {System.out.println("chunk found "+"("+$nounPhrase.text+")");};
Each of your rules (chunker being the one I can spot quickly) have attributes (extra information) associated with them. You can find a quick list of the different attributes for the different types of rules at http://www.antlr.org/wiki/display/ANTLR3/Attribute+and+Dynamic+Scopes, would be nice if descriptions were put on the web page for each of those attributes (like for the start and stop attribute for the parser rules refer to tokens from your lexer - which would allow you to get back to your line number and position).
I think your chunker rule should just be changed slightly, instead of $nounPhrase you should use $nounPhrase.text. text is an attribute for your nounPhrase rule.
You might want to do a little other formating as well, usually the parser rules (start with lowercase letter) appear before the lexer rules (start with uppercase letter)
PS. When I type in the box the chunker rule is starting on a new line but in my original answer it didn't start on a new line.
If you accidentally do something silly like $thing.$attribute where you mean $thing.attribute, you will also see the missing attribute access on rule scope error message. (I know this question was answered a long time ago, but this bit of trivia might help someone else who sees the error message!)
Answering question after having found a better way...
WS : (' '|'\t')+;
TOKEN : (('A'..'Z') | ('a'..'z'))+;
dttok : 'DT' WS TOKEN;
nntok : 'NN' WS TOKEN;
nounPhrase : (dttok WS nntok);
chunker : nounPhrase ;
The problem was I was getting muddled between the lexer and the parser (this is apparently very common). The uppercase items are lexical, the lowercase in the parser. This now seems to work. (NB I have changed NP to NN).

Regex Named Groups in Java

It is my understanding that the java.regex package does not have support for named groups (http://www.regular-expressions.info/named.html) so can anyone point me towards a third-party library that does?
I've looked at jregex but its last release was in 2002 and it didn't work for me (admittedly I only tried briefly) under java5.
(Update: August 2011)
As geofflane mentions in his answer, Java 7 now support named groups.
tchrist points out in the comment that the support is limited.
He details the limitations in his great answer "Java Regex Helper"
Java 7 regex named group support was presented back in September 2010 in Oracle's blog.
In the official release of Java 7, the constructs to support the named capturing group are:
(?<name>capturing text) to define a named group "name"
\k<name> to backreference a named group "name"
${name} to reference to captured group in Matcher's replacement string
Matcher.group(String name) to return the captured input subsequence by the given "named group".
Other alternatives for pre-Java 7 were:
Google named-regex (see John Hardy's answer)
Gábor Lipták mentions (November 2012) that this project might not be active (with several outstanding bugs), and its GitHub fork could be considered instead.
jregex (See Brian Clozel's answer)
(Original answer: Jan 2009, with the next two links now broken)
You can not refer to named group, unless you code your own version of Regex...
That is precisely what Gorbush2 did in this thread.
Regex2
(limited implementation, as pointed out again by tchrist, as it looks only for ASCII identifiers. tchrist details the limitation as:
only being able to have one named group per same name (which you don’t always have control over!) and not being able to use them for in-regex recursion.
Note: You can find true regex recursion examples in Perl and PCRE regexes, as mentioned in Regexp Power, PCRE specs and Matching Strings with Balanced Parentheses slide)
Example:
String:
"TEST 123"
RegExp:
"(?<login>\\w+) (?<id>\\d+)"
Access
matcher.group(1) ==> TEST
matcher.group("login") ==> TEST
matcher.name(1) ==> login
Replace
matcher.replaceAll("aaaaa_$1_sssss_$2____") ==> aaaaa_TEST_sssss_123____
matcher.replaceAll("aaaaa_${login}_sssss_${id}____") ==> aaaaa_TEST_sssss_123____
(extract from the implementation)
public final class Pattern
implements java.io.Serializable
{
[...]
/**
* Parses a group and returns the head node of a set of nodes that process
* the group. Sometimes a double return system is used where the tail is
* returned in root.
*/
private Node group0() {
boolean capturingGroup = false;
Node head = null;
Node tail = null;
int save = flags;
root = null;
int ch = next();
if (ch == '?') {
ch = skip();
switch (ch) {
case '<': // (?<xxx) look behind or group name
ch = read();
int start = cursor;
[...]
// test forGroupName
int startChar = ch;
while(ASCII.isWord(ch) && ch != '>') ch=read();
if(ch == '>'){
// valid group name
int len = cursor-start;
int[] newtemp = new int[2*(len) + 2];
//System.arraycopy(temp, start, newtemp, 0, len);
StringBuilder name = new StringBuilder();
for(int i = start; i< cursor; i++){
name.append((char)temp[i-1]);
}
// create Named group
head = createGroup(false);
((GroupTail)root).name = name.toString();
capturingGroup = true;
tail = root;
head.next = expr(tail);
break;
}
For people coming to this late: Java 7 adds named groups. Matcher.group(String groupName) documentation.
Yes but its messy hacking the sun classes. There is a simpler way:
http://code.google.com/p/named-regexp/
named-regexp is a thin wrapper for the
standard JDK regular expressions
implementation, with the single
purpose of handling named capturing
groups in the .net style :
(?...).
It can be used with Java 5 and 6
(generics are used).
Java 7 will handle named capturing
groups , so this project is not meant
to last.
What kind of problem do you get with jregex?
It worked well for me under java5 and java6.
Jregex does the job well (even if the last version is from 2002), unless you want to wait for javaSE 7.
For those running pre-java7, named groups are supported by joni (Java port of the Oniguruma regexp library). Documentation is sparse, but it has worked well for us.
Binaries are available via Maven (http://repository.codehaus.org/org/jruby/joni/joni/).
A bit old question but I found myself needing this also and that the suggestions above were inaduquate - and as such - developed a thin wrapper myself: https://github.com/hofmeister/MatchIt

Categories

Resources