It is my understanding that the java.regex package does not have support for named groups (http://www.regular-expressions.info/named.html) so can anyone point me towards a third-party library that does?
I've looked at jregex but its last release was in 2002 and it didn't work for me (admittedly I only tried briefly) under java5.
(Update: August 2011)
As geofflane mentions in his answer, Java 7 now support named groups.
tchrist points out in the comment that the support is limited.
He details the limitations in his great answer "Java Regex Helper"
Java 7 regex named group support was presented back in September 2010 in Oracle's blog.
In the official release of Java 7, the constructs to support the named capturing group are:
(?<name>capturing text) to define a named group "name"
\k<name> to backreference a named group "name"
${name} to reference to captured group in Matcher's replacement string
Matcher.group(String name) to return the captured input subsequence by the given "named group".
Other alternatives for pre-Java 7 were:
Google named-regex (see John Hardy's answer)
Gábor Lipták mentions (November 2012) that this project might not be active (with several outstanding bugs), and its GitHub fork could be considered instead.
jregex (See Brian Clozel's answer)
(Original answer: Jan 2009, with the next two links now broken)
You can not refer to named group, unless you code your own version of Regex...
That is precisely what Gorbush2 did in this thread.
Regex2
(limited implementation, as pointed out again by tchrist, as it looks only for ASCII identifiers. tchrist details the limitation as:
only being able to have one named group per same name (which you don’t always have control over!) and not being able to use them for in-regex recursion.
Note: You can find true regex recursion examples in Perl and PCRE regexes, as mentioned in Regexp Power, PCRE specs and Matching Strings with Balanced Parentheses slide)
Example:
String:
"TEST 123"
RegExp:
"(?<login>\\w+) (?<id>\\d+)"
Access
matcher.group(1) ==> TEST
matcher.group("login") ==> TEST
matcher.name(1) ==> login
Replace
matcher.replaceAll("aaaaa_$1_sssss_$2____") ==> aaaaa_TEST_sssss_123____
matcher.replaceAll("aaaaa_${login}_sssss_${id}____") ==> aaaaa_TEST_sssss_123____
(extract from the implementation)
public final class Pattern
implements java.io.Serializable
{
[...]
/**
* Parses a group and returns the head node of a set of nodes that process
* the group. Sometimes a double return system is used where the tail is
* returned in root.
*/
private Node group0() {
boolean capturingGroup = false;
Node head = null;
Node tail = null;
int save = flags;
root = null;
int ch = next();
if (ch == '?') {
ch = skip();
switch (ch) {
case '<': // (?<xxx) look behind or group name
ch = read();
int start = cursor;
[...]
// test forGroupName
int startChar = ch;
while(ASCII.isWord(ch) && ch != '>') ch=read();
if(ch == '>'){
// valid group name
int len = cursor-start;
int[] newtemp = new int[2*(len) + 2];
//System.arraycopy(temp, start, newtemp, 0, len);
StringBuilder name = new StringBuilder();
for(int i = start; i< cursor; i++){
name.append((char)temp[i-1]);
}
// create Named group
head = createGroup(false);
((GroupTail)root).name = name.toString();
capturingGroup = true;
tail = root;
head.next = expr(tail);
break;
}
For people coming to this late: Java 7 adds named groups. Matcher.group(String groupName) documentation.
Yes but its messy hacking the sun classes. There is a simpler way:
http://code.google.com/p/named-regexp/
named-regexp is a thin wrapper for the
standard JDK regular expressions
implementation, with the single
purpose of handling named capturing
groups in the .net style :
(?...).
It can be used with Java 5 and 6
(generics are used).
Java 7 will handle named capturing
groups , so this project is not meant
to last.
What kind of problem do you get with jregex?
It worked well for me under java5 and java6.
Jregex does the job well (even if the last version is from 2002), unless you want to wait for javaSE 7.
For those running pre-java7, named groups are supported by joni (Java port of the Oniguruma regexp library). Documentation is sparse, but it has worked well for us.
Binaries are available via Maven (http://repository.codehaus.org/org/jruby/joni/joni/).
A bit old question but I found myself needing this also and that the suggestions above were inaduquate - and as such - developed a thin wrapper myself: https://github.com/hofmeister/MatchIt
Related
I have written a program which is going to read from csv file using a delimiter and at the same time I have a use case where I needed to create a string from of the delimited data so for that I have created a regex to split the column data using the delimiter.
Now the challenge is when delimiter is present in in double quotes ideally I should not be splitting the data, spark is escaping that delimiter but my regex somehow are not.
private static void readFromSourceFile(SparkSession sparkSession) {
String delType = ",";
final String regex = "["+delType+ "]{"+delType.length()+"}(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
Dataset<Row> csv = sparkSession
.read().option("delimiter",delType)
.option("header",false)
.option("inferSchema",true)
.csv("src/main/resources/quotes2.csv");
char separator= '\u0001';
csv.show(false);
List<Row> df = csv.collectAsList();
String split[] = df.get(0).toString().split(regex);
System.out.println(split.length);
Arrays.stream(split).forEach(System.out::println);
}
The O/P for the program is -
The Red marked area is the string with double quotes, it shouldn't have split that column.
Input file csv file -
New,667.88,In Stock.,Now,true,true,B09D7MQ69X,B09D7MQ69X,NUC10i5FNHN 16GB+512GB,"Intel NUC10 NUC10i5FNHN Home & Business Desktop Mini PC,10th Generation Intel® Core™ i5-10210U, Upto 4.2 GHz, 4 core, 8 Thread, 25W Intel® UHD Graphics, 16GB RAM, 512GB PCIe SSD, Win 10 Pro 8GB RAM + 256GB SSD",false,"【Intel NUC10i5FNHN with RAM & SSD】 Intel NUC10 NUC10i5FNHN Mini PC/HTPC With All New Parts Assembled. Our store is HOT selling Intel NUC11 i5, i7, NUC10 i5, i7, NUC8, Barebone and Mini PC with various sizes of RAM or SSD. If you need to know more, please click on our Store Name:""GEEK + Computer Mall"" --------- ""Products"", OR click ""Visit the GEEK+ Store"" under the title.:BRK:【Quad Core Processor & Graphic 】 10th Generation Intel Core i5-10210U,1.6 GHz – 4.2 GHz Turbo, 4 core, 8 thread, 6MB Cache,25W Intel UHD Graphics, up to 1.0 GHz, 80 EU units.:BRK:【Storage Expansion Options】 Kingston 16GB DDR4 RAM"
Can someone suggest or provide a hint to improve the regex.
I find that splitting by complex delimiters leads to convoluted regular expressions, as your code showcases. In fact, the sub-expression "["+delType+ "]{"+delType.length()+"}" doesn’t make a lot of sense, and I strongly suspect this is a bug in your code (for instance if delType is <> your code would also split on occurrences of ><).
As an alternative, consider using a regular expression that exhaustively describes the lexical syntax of your input, and then match all tokens. This works particularly well when using named groups.
In your case (CSV with quoted fields, using doubled-up quotes to escape them), the lexical syntax of the tokens can be described by the following token types:
A delimiter (which is configurable, so we might need to handle strings of length > 1)
A quoted field (arbitrary tokens surrounded by "…", where the characters in … can be anything except ", but they can also include "")
An unquoted field (arbitrary tokens up to the next delimiter).
As a regular expression, this can be written as follows in Java:
Pattern.compile(
"(?<delim>" + d + ")|" +
"\"(?<quotedField>(?:[^\"]|\"\")*)\"|" +
"(?<field>.*?(?:(?=" + d + ")|$))"
);
Where d is defined as Pattern.quote(delim) (the quoting is important, in case the delimiter contains a regex special char!).
The only slight complication here is the last token type, because we are matching everything up to the next delimiter (.*? matches non-greedily), or until the end of the string.
Afterwards, we iterate over all matches and collect those where either the group field or quotedField is set. Putting it all together inside a method:
static String[] parseCsvRow(String row, String delim) {
final String d = Pattern.quote(delim);
final Pattern pattern = Pattern.compile(
"(?<delim>" + d + ")|" +
"\"(?<quotedField>(?:[^\"]|\"\")*)\"|" +
"(?<field>.*?(?:(?=" + d + ")|$))"
);
final Matcher matcher = pattern.matcher(row);
final List<String> results = new ArrayList<>();
while (matcher.find()) {
if (matcher.group("field") != null) {
results.add(matcher.group("field"));
} else if (matcher.group("quotedField") != null) {
results.add(matcher.group("quotedField").replaceAll("\"\"", "\""));
}
}
return results.toArray(new String[0]);
}
In real code I would wrap this in a CsvParser class instead of a single method, where the constructor creates the pattern based on the delimiter so that the pattern doesn’t have to be recompiled for each row.
I was wondering is someone can help me out here. I think this could be of use for anyone trying to conduct machine learning on GATE (General Architecture for Text Engineering). So basically to conduct machine learning I first need to add some code to a few jape files so my output XML file would print out the Annotation Id value as a feature. An example is provided below:
<Annotation Id="1491" Type="Person" StartNode="288" EndNode="301">
<Feature>
<Name className="java.lang.String">id</Name>
<Value className="java.lang.String">1491</Value>
</Feature>
(Note that the feature value of 1491 matches the Annotation Id="1491". This is what I want.)
WHY I NEED THIS:
I am conducting machine learning on a plain text document that initially contains no annotation. I am using the June 2012 training course that is on the GATE website as a guide while doing this. I am specifically following the Module 11: Relations tutorial (it finds employment relationships between person and organization). I utilize the corpus of 93 pre-annotated documents for training and then apply that learned module on my document. But first I run my document through ANNIE. It creates many annotations and features but not everything that I need for machine learning. I've learned through trial/error and investigation that my annotated document must contain features with the Annotation Id for every "Person" and "Organization" type. I recognize that the configuration file (relations-config.xml) that is used in the Batch Learning PR looks for id features for "Person" and "Organization" types. It will not run if these ID features are not present. So I add this manually and then run it through the machine learning "APPLICATION" mode. It works rather nicely. However I clearly do not want to add the id features to my XML file manually every time.
WHAT I HAVE FIGURED OUT WITH THE GATE CODE:
I believe I have found the code files (final.jape, org_context.jape and name_context.jape) that I need to alter so they can add that id feature to every annotation that contains "Person" and "Organization". I don't understand the language that GATE uses very well (I'm a mechanical engineer, not a software engineer) and this is probably why I can't figure this out (Ha!). Anyhow, I could be off and may need to add a few more lines in for the jape file to work properly, but I feel like I've pinpointed it pretty closely. There are two sections of code that are similar but slightly different, which are currently the bane of my existence. The first one goes through an iterator loop, the second one does not. I copy/pasted those 2 those below with a line stating WHAT_DO_I_PUT_HERE that indicate where I think my problem and solution lies. I would be very grateful if someone can help me with what I need to write to get my result.
Thank you!
- Todd
//////////// First section of code ////////////////
Rule: PersonFinal
Priority: 30
//({JobTitle}
//)?
(
{TempPerson.kind == personName}
)
:person
-->
{
gate.FeatureMap features = Factory.newFeatureMap();
gate.AnnotationSet personSet = (gate.AnnotationSet)bindings.get("person");
gate.Annotation person1Ann = (gate.Annotation)personSet.iterator().next();
gate.AnnotationSet firstPerson = (gate.AnnotationSet)personSet.get("TempPerson");
if (firstPerson != null && firstPerson.size()>0)
{
gate.Annotation personAnn = (gate.Annotation)firstPerson.iterator().next();
if (personAnn.getFeatures().containsKey("gender")) features.put("gender", personAnn.getFeatures().get("gender"));
}
features.put("id", WHAT_DO_I_PUT_HERE.getId().toString());
features.put("rule1", person1Ann.getFeatures().get("rule"));
features.put("rule", "PersonFinal");
outputAS.add(personSet.firstNode(), personSet.lastNode(), "Person", features);
outputAS.removeAll(personSet);
}
//////////// Second section of code ////////////////
Rule:OrgContext1
Priority: 1
// company X
// company called X
(
{Token.string == "company"}
(({Token.string == "called"}|
{Token.string == "dubbed"}|
{Token.string == "named"}
)
)?
)
(
{Unknown.kind == PN}
)
:org
-->
{
gate.AnnotationSet org = (gate.AnnotationSet) bindings.get("org");
gate.FeatureMap features = Factory.newFeatureMap();
features.put("id", WHAT_DO_I_PUT_HERE.getId().toString());
features.put("rule ", "OrgContext1");
outputAS.add(org.firstNode(), org.lastNode(), "Organization", features);
outputAS.removeAll(org);
}
You cannot access the annotation id before the actual annotation is created. My solution of this problem:
Rule:PojemId
(
{PojemD}
):pojem
-->
{
AnnotationSet matchedAnns = bindings.get("pojem");
Annotation ann = matchedAnns.get("PojemD").iterator().next();
FeatureMap pojemFeatures = ann.getFeatures();
gate.FeatureMap features = Factory.newFeatureMap();
features.putAll(pojemFeatures);
features.put("annId", ann.getId());
inputAS.remove(ann);
Integer id = outputAS.add(matchedAnns.firstNode(), matchedAnns.lastNode(), "PojemD", features);
features.put("id", id);
}
It's quite simple. You have to mark the annotation on the Right Hand Side (RHS) of the rule by some label (token_match in my example bellow) and then, on the Left Hand Side (LHS) of the rule, just obtain corresponding AnnotationSet form bindings variable and iterate through annotations (usually there is only a single annotation in it) and copy corresponding IDs to the output.
Phase: Main
Input: Token
Rule: WriteTokenID
(
({Token}):token_match
)
-->
{
AnnotationSet as = bindings.get("token_match");
for (Annotation a : as)
{
FeatureMap features = Factory.newFeatureMap();
features.put("origTokenId", a.getId());
outputAS.add(a.getStartNode(), a.getEndNode(), "NewToken", features);
}
}
In your code, you probably want to mark {TempPerson.kind == personName}and {Unknown.kind == PN} somehow like bellow.
(
({TempPerson.kind == personName}):temp_person
)
:person
and
(
{Token.string == "company"}
(({Token.string == "called"}|
{Token.string == "dubbed"}|
{Token.string == "named"}
)
)?
)
(
({Unknown.kind == PN}):unknown_org
)
:org
And them use bindings.get("temp_person") and bindings.get("unknown_org") respectively.
I have to code a Lexer in Java for a dialect of BASIC.
I group all the TokenType in Enum
public enum TokenType {
INT("-?[0-9]+"),
BOOLEAN("(TRUE|FALSE)"),
PLUS("\\+"),
MINUS("\\-"),
//others.....
}
The name is the TokenType name and into the brackets there is the regex that I use to match the Type.
If i want to match the INT type i use "-?[0-9]+".
But now i have a problem. I put into a StringBuffer all the regex of the TokenType with this:
private String pattern() {
StringBuffer tokenPatternsBuffer = new StringBuffer();
for(TokenType token : TokenType.values())
tokenPatternsBuffer.append("|(?<" + token.name() + ">" + token.getPattern() + ")");
String tokenPatternsString = tokenPatternsBuffer.toString().substring(1);
return tokenPatternsString;
}
So it returns a String like:
(?<INT>-?[0-9]+)|(?<BOOLEAN>(TRUE|FALSE))|(?<PLUS>\+)|(?<MINUS>\-)|(?<PRINT>PRINT)....
Now i use this string to create a Pattern
Pattern pattern = Pattern.compile(STRING);
Then I create a Matcher
Matcher match = pattern.match("line of code");
Now i want to match all the TokenType and group them into an ArrayList of Token. If the code syntax is correct it returns an ArrayList of Token (Token name, value).
But i don't know how to exit the while-loop if the syntax is incorrect and then Print an Error.
This is a piece of code used to create the ArrayList of Token.
private void lex() {
ArrayList<Token> tokens = new ArrayList<Token>();
int tokenSize = TokenType.values().length;
int counter = 0;
//Iterate over the arrayLinee (ArrayList of String) to get matches of pattern
for(String linea : arrayLinee) {
counter = 0;
Matcher match = pattern.matcher(linea);
while(match.find()) {
System.out.println(match.group(1));
counter = 0;
for(TokenType token : TokenType.values()) {
counter++;
if(match.group(token.name()) != null) {
tokens.add(new Token(token , match.group(token.name())));
counter = 0;
continue;
}
}
if(counter==tokenSize) {
System.out.println("Syntax Error in line : " + linea);
break;
}
}
tokenList.add("EOL");
}
}
The code doesn't break if the for-loop iterate over all TokenType and doesn't match any regex of TokenType. How can I return an Error if the Syntax isn't correct?
Or do you know where I can find information on developing a lexer?
All you need to do is add an extra "INVALID" token at the end of your enum type with a regex like ".+" (match everything). Because the regexs are evaluated in order, it will only match if no other token was found. You then check to see if the last token in your list was the INVALID token.
If you are working in Java, I recommend trying out ANTLR 4 for creating your lexer. The grammar syntax is much cleaner than regular expressions, and the lexer generated from your grammar will automatically support reporting syntax errors.
If you are writing a full lexer, I'd recommend use an existing grammar builder. Antlr is one solution but I personally recommend parboiled instead, which allows to write grammars in pure Java.
Not sure if this was answered, or you came to an answer, but a lexer is broken into two distinct phases, the scanning phase and the parsing phase. You can combine them into one single pass (regex matching) but you'll find that a single pass lexer has weaknesses if you need to do anything more than the most basic of string translations.
In the scanning phase you're breaking the character sequence apart based on specific tokens that you've specified. What you should have done was included an example of the text you were trying to parse. But Wiki has a great example of a simple text lexer that turns a sentence into tokens (eg. str.split(' ')). So with the scanner you're going to tokenize the block of text into chunks by spaces(this should be the first action almost always) and then you're going to tokenize even further based on other tokens, such as what you're attempting to match.
Then the parsing/evaluation phase will iterate over each token and decide what to do with each token depending on the business logic, syntax rules etc., whatever you set it. This could be expressing some sort of math function to perform (eg. max(3,2)), or a more common example is for query language building. You might make a web app that has a specific query language (SOLR comes to mind, as well as any SQL/NoSQL DB) that is translated into another language to make requests against a datasource. Lexers are commonly used in IDE's for code hinting and auto-completion as well.
This isn't a code-based answer, but it's an answer that should give you an idea on how to tackle the problem.
I've used the java compiler tree api to generate the ast for java source files. However, i'm unable to access th comments in the source files.
So far, i've been unable to find a way to extract comments from source file .. is there a way using the compiler api or some other tool ?
Our SD Java Front End is a Java parser that builds ASTs (and optionally symbol tables). It captures comments directly on tree nodes.
The Java Front End is a member of a family of compiler langauge front ends (C, C++, C#, COBOL, JavaScript, ...) all of which are supported by DMS Software Reengineering Toolkit. DMS is designed to process languages for the purposes of transformation, and thus can capture comments, layout and formats to enable regeneration of code preserving the original layout as much as possible.
EDIT 3/29/2012: (in contrast to answer posted for doing this with ANTLR)
To get a comment on an AST node in DMS, one calls the DMS (lisp-like) function
(AST:GetComments <node>)
which provide access to the array of comments associated with the AST node. One can inquire about the length of this array (may be null), or for each array element, ask for any of these properties: (AST:Get... FileIndex, Line, Column, EndLine, EndColumn, String (exact Unicode comment content).
The comments obtained through getCommentList method of CompilationUnit will not have the comment body. Also the comments will not be visited, during and AST Visit. Inorder to visit the comments we have call accept method for each comment in the Comment List.
for (Comment comment : (List<Comment>) compilationUnit.getCommentList()) {
comment.accept(new CommentVisitor(compilationUnit, classSource.split("\n")));
}
The body of the comments can be obtained using some simple logic. In the below AST Visitor for comments, we need to specify the Complied class unit and the source code of the class during initialization.
import org.eclipse.jdt.core.dom.ASTNode;
import org.eclipse.jdt.core.dom.ASTVisitor;
import org.eclipse.jdt.core.dom.BlockComment;
import org.eclipse.jdt.core.dom.CompilationUnit;
import org.eclipse.jdt.core.dom.LineComment;
public class CommentVisitor extends ASTVisitor {
CompilationUnit compilationUnit;
private String[] source;
public CommentVisitor(CompilationUnit compilationUnit, String[] source) {
super();
this.compilationUnit = compilationUnit;
this.source = source;
}
public boolean visit(LineComment node) {
int startLineNumber = compilationUnit.getLineNumber(node.getStartPosition()) - 1;
String lineComment = source[startLineNumber].trim();
System.out.println(lineComment);
return true;
}
public boolean visit(BlockComment node) {
int startLineNumber = compilationUnit.getLineNumber(node.getStartPosition()) - 1;
int endLineNumber = compilationUnit.getLineNumber(node.getStartPosition() + node.getLength()) - 1;
StringBuffer blockComment = new StringBuffer();
for (int lineCount = startLineNumber ; lineCount<= endLineNumber; lineCount++) {
String blockCommentLine = source[lineCount].trim();
blockComment.append(blockCommentLine);
if (lineCount != endLineNumber) {
blockComment.append("\n");
}
}
System.out.println(blockComment.toString());
return true;
}
public void preVisit(ASTNode node) {
}
}
Edit: Moved splitting of source out of the visitor.
Just for the record. Now with Java 8 you have a whole interface to play with comments and documentation details here.
You might to use a different tool, like ANTLR's Java grammar. javac has no use for comments, and is likely to discard them completely. The parsers upon which tools like IDEs are built are more likely to retain comments in their AST.
Managed to solve the problem by using the getsourceposition() and some string manipulation (no regex was needed)
I would like a regular expression that will extract email addresses from a String (using Java regular expressions).
That really works.
Here's the regular expression that really works.
I've spent an hour surfing on the web and testing different approaches,
and most of them didn't work although Google top-ranked those pages.
I want to share with you a working regular expression:
[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})
Here's the original link:
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
I had to add some dashes to allow for them. So a final result in Javanese:
final String MAIL_REGEX = "([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\\.[A-Za-z0-9-]+)*(\\.[A-Za-z]{2,})";
Install this regex tester plugin into eclipse, and you'd have whale of a time testing regex
http://brosinski.com/regex/.
Points to note:
In the plugin, use only one backslash for character escape. But when you transcribe the regex into a Java/C# string you would have to double them as you would be performing two escapes, first escaping the backslash from Java/C# string mechanism, and then second for the actual regex character escape mechanism.
Surround the sections of the regex whose text you wish to capture with round brackets/ellipses. Then, you could use the group functions in Java or C# regex to find out the values of those sections.
([_A-Za-z0-9-]+)(\.[_A-Za-z0-9-]+)#([A-Za-z0-9]+)(\.[A-Za-z0-9]+)
For example, using the above regex, the following string
abc.efg#asdf.cde
yields
start=0, end=16
Group(0) = abc.efg#asdf.cde
Group(1) = abc
Group(2) = .efg
Group(3) = asdf
Group(4) = .cde
Group 0 is always the capture of whole string matched.
If you do not enclose any section with ellipses, you would only be able to detect a match but not be able to capture the text.
It might be less confusing to create a few regex than one long catch-all regex, since you could programmatically test one by one, and then decide which regexes should be consolidated. Especially when you find a new email pattern that you had never considered before.
a little late but ok.
Here is what i use. Just paste it in the console of FireBug and run it. Look on the webpage for a 'Textarea' (Most likely on the bottom of the page) That will contain a , seperated list of all email address found in A tags.
var jquery = document.createElement('script');
jquery.setAttribute('src', 'http://code.jquery.com/jquery-1.10.1.min.js');
document.body.appendChild(jquery);
var list = document.createElement('textarea');
list.setAttribute('emaillist');
document.body.appendChild(list);
var lijst = "";
$("#emaillist").val("");
$("a").each(function(idx,el){
var mail = $(el).filter('[href*="#"]').attr("href");
if(mail){
lijst += mail.replace("mailto:", "")+",";
}
});
$("#emaillist").val(lijst);
The Java 's build-in email address pattern (Patterns.EMAIL_ADDRESS) works perfectly:
public static List<String> getEmails(#NonNull String input) {
List<String> emails = new ArrayList<>();
Matcher matcher = Patterns.EMAIL_ADDRESS.matcher(input);
while (matcher.find()) {
int matchStart = matcher.start(0);
int matchEnd = matcher.end(0);
emails.add(input.substring(matchStart, matchEnd));
}
return emails;
}