Find out which group matches in Java regex without linear search? - java

I have some programmatically assembled huge regex, like this
(A)|(B)|(C)|...
Each sub-pattern is in its capturing group. When I get a match, how do I figure out which group matches without linearly testing each group(i) to see it returns a non-null string?

If your regex is programmatically generated, why not programmatically generate n separate regexes and test each of them in turn? Unless they share a common prefix and the Java regex engine is clever, all alternatives get tested anyway.
Update: I just looked through the Sun Java source, in particular, java.util.regex.Pattern$Branch.match(), and that does also simply do a linear search over all alternatives, trying each in turn. The other places where Branch is used do not suggest any kind of optimization of common prefixes.

You can use non-capturing groups, instead of:
(A)|(B)|(C)|...
replace with
((?:A)|(?:B)|(?:C))
The non-capturing groups (?:) will not be included in the group count, but the result of the branch will be captured in the outer () group.

Break up your regex into three:
String[] regexes = new String[] { "pattern1", "pattern2", "pattern3" };
for(int i = 0; i < regexes.length; i++) {
Pattern pattern = Pattern.compile(regexes[i]);
Matcher matcher = pattern.matcher(inputStr);
if(matcher.matches()) {
//process, optionally break out of loop
}
}
public int getMatchedGroupIndex(Matcher matcher) {
int index = -1;
for(int i = 0; i < matcher.groupCount(); i++) {
if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
index = i;
}
}
return index;
}
The alternative is:
for(int i = 0; i < matcher.groupCount(); i++) {
if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
//process, optionally break out of loop
}
}

I don't think you can get around the linear search, but you can make it a lot more efficient by using start(int) instead of group(int).
static int getMatchedGroupIndex(Matcher m)
{
int index = -1;
for (int i = 1, n = m.groupCount(); i <= n; i++)
{
if ( (index = m.start(i)) != -1 )
{
break;
}
}
return index;
}
This way, instead of generating a substring for every group, you just query an int value representing its starting index.

From the various comments, it seems that the simple answer is "no", and that using separate regexes is a better idea. To improve on that approach, you might need to figure out the common pattern prefixes when you generate them, or use your own regex (or other) pattern matching engine. But before you go to all of that effort, you need to be sure that this is a significant bottleneck in your system. In other words, benchmark it and see if the performance is acceptable for realistic input data, and if not the profile it to see where the real bottlenecks are.

Related

Adding everything but the nth element to another arraylist

For my project we have to manipulate certain LISP phrasing using Java. One of the tasks is given:
'((4A)(1B)(2C)(2A)(1D)(4E)2)
The number at the end is the "n". The task is to delete every nth element from the expression. For example, the expression above would evaluate to:
′((4A)(2C)(1D)2)
My approach right now is adding all the elements that aren't at the nth index to another array. My error is that it adds every single element to the new array leaving both elements identical.
My code:
String input4=inData.nextLine();
length=input4.length();
String nString=input4.substring(length-2,length-1);
int n = Integer.parseInt(nString);
count=n;
String delete1=input4.replace("'(","");
String delete2=delete1.replace("(","");
final1=delete2.replace(")","");
length=final1.length();
for (int i=1;i<length;i++)
{
part=final1.substring(i-1,i);
list.add(part);
}
for(int i=0;i<=list.size();i++)
{
if(!(i%n==0))
{
delete.add(list.get(i-1));
delete.add(list.get(i));
}
else
{
}
}
System.out.print("\n"+list);
One solution to this problem (although not directly addressing your issue in your solution) is to use a Regex Pattern, as these work nicely for this sort of thing, especially if this code does not have to adapt much to different input strings. I find if something like this is possible, it is easier than trying to directly manipulate Strings, although these Patterns (and Regexs in general) are slow.
// Same as you had before
String input4="'((4A)(1B)(2C)(2A)(1D)(4E)2)";
int length=input4.length();
String nString=input4.substring(length-2,length-1);
int n = Integer.parseInt(nString);
int count=n;
// Match (..)
// This could be adapted to catch ( ) with anything in it other than another
// set of parentheses.
Matcher m = Pattern.compile("\\(.{2}\\)").matcher(input4);
// Initialize with the start of the resulting string.
StringBuilder sb = new StringBuilder("'(");
int i = 0;
while (m.find())
{
// If we are not at an index to skip, then append this group
if (++i % count != 0)
{
sb.append(m.group());
}
}
// Add the end, which is the count and the ending parentheses.
sb.append(count).append(")");
System.out.println(sb.toString());
Some example input/output:
'((4A)(1B)(2C)(2A)(1D)(4E)2)
'((4A)(2C)(1D)2)
'((4A)(1B)(2C)(2A)(1D)(4E)3)
'((4A)(1B)(2A)(1D)3)

How to mimic transliterate in Java?

In Perl, I usually use the transliteration to count the number of characters in a string that match a set of possible characters. Things like:
$c1=($a =~ y[\x{0410}-\x{042F}\x{0430}-\x{044F}]
[\x{0410}-\x{042F}\x{0430}-\x{044F}]);
would count the number of Cyrillic characters in $a. As in the previous example I have two classes (or two ranges, if you prefer), I have some other with some more classes:
$c4=($a =~ y[\x{AC00}-\x{D7AF}\x{1100}-\x{11FF}\x{3130}-\x{318F}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}]
[\x{AC00}-\x{D7AF}\x{1100}-\x{11FF}\x{3130}-\x{318F}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}]);
Now, I need to do a similar thing in Java. Is there a similar construct in Java? Or I need to iterate over all characters, and check if it is between the limits of each class?
Thank you
Haven't seen anything like tr/// in Java.
You could use something like this to count all the matches tho:
Pattern p = Pattern.compile("[\\x{0410}-\\x{042F}\\x{0430}-\\x{044F}]",
Pattern.CANON_EQ);
Matcher m = p.matcher(string);
int count = 0;
while (m.find())
count++;
For good order: using the Java Unicode support.
int countCyrillic(String s) {
int n = 0;
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
i += Character.charCount(codePoint);
if (UnicodeScript.of(codePoint) == UnicodeScript.CYRILLIC) {
++n;
}
}
return n;
}
This uses the full Unicode (where two 16 bit chars may represent a Unicode "code point."
And in Java the class Character.UnicodeScript has already everything you need.
Or:
int n = s.replaceAll("\\P{CYRILLIC}", "").length();
Here \\P is the negative of \\p{CYRILLIC} the Cyrillic group.
You can try to play with something like this:
s.replaceAll( "[^\x{0410}-\x{042F}\x{0430}-\x{044F}]*([\x{0410}-\x{042F}\x{0430}-\x{044F}])?", "$1" ).length()
The idea was borrowed from here: Simple way to count character occurrences in a string

Searching for Variable Scope { } in Text

I need to identify {scope} in text, such as source code.
I'm starting with just a single line and will expand to search multiple lines, and exclude comments. I already have working code using Pattern Matcher, but I would like critiquing on how to improve such a search.
String line = "{{outside{inside}{inside2}}};";
String scopeOf = "outside";
findscope(line,scopeOf);
private static void findscope(String line,
String scopeOf) {
int layer = 1;
Pattern p = Pattern.compile(scopeOf);
Matcher m = p.matcher(line);
if (m.find()) {
int scopestart = m.start();
int scopeEnd = Integer.MIN_VALUE;
m.usePattern(Pattern.compile("\\{|\\}"));
while (m.find()) {
String group = m.group();
if (group.equals("{")) {
layer++;
} else if (group.equals("}")) {
layer--;
}
if (layer == 0) {
scopeEnd = m.start();
break;
}
}
System.out.println("Scope of " + scopeOf + " starts at " + scopestart +
" finishes at " + scopeEnd);
}
}
Well, you are using the wrong tool for the job (assuming you are also looking for nested scopes)
Note that regex (in the traditional form of regex) stands for Regular Expression - which is a way to describe a Regular Language.
However, the language L = { all words with legal scopings } is irregular - and thus cannot be identified by regex.
This langauge is actually Conext Free Langauge, and can be represented by a Context Free Grammer.
For parsing:
For relatively simple langauges (scoping is among them) - a deterministic push-down automaton is enough to verify them.
Some languages require non deterministic push down automaton - which is not very efficiently created, but there is a dynamic programming algorithm to parse them as well.
As a side note, there are some tools such as JavaCC that you can use to parse (and generate code/output) - have a look on them, but if you are simply looking for the scoping issue - it is probably an overkill.
Edit - pseudo code:
curr <- 0
count <- 0 //integer imitates the stack for this simple usage
l <- string.length()
while (curr < l):
if string.charAt(curr) == '{':
count++;
else if string.charAt(curr) == '}':
if curr <= 0:
return ERROR;
count--;
curr++;
if count != 0:
return ERROR;
return SUCCESS;
Note that in here we can use an integer to imitate the stack, in here an increase is basically a push() and a decrease is a pop().

Regex capture group match lookup

I have a regex with multiple disjunctive capture groups
(a)|(b)|(c)|...
Is there a faster way than this one to access the index of the first successfully matching capture group?
(matcher is an instance of java.util.regex.Matcher)
int getCaptureGroup(Matcher matcher){
for(int i = 1; i <= matcher.groupCount(); ++i){
if(matcher.group(i) != null){
return i;
}
}
}
That depends on what you mean by faster. You can make the code a little more efficient by using start(int) instead of group(int)
if(matcher.start(i) != -1){
If you don't need the actual content of the group, there's no point trying to create a new string object to hold it. I doubt you'll notice any difference in performance, but there's no reason not to do it this way.
But you still have to write the same amount of boilerplate code; there's no way around that. Java's regex flavor is severely lacking in syntactic sugar compared to most other languages.
I guess the pattern is so:
if (matcher.find()) {
String wholeMatch = matcher.group(0);
String firstCaptureGroup = matcher.group(1);
String secondCaptureGroup = matcher.group(2);
//etc....
}
There could be more than one match. So you could use while cycle for going through all matches.
Please take a look at "Group number" section in javadoc of java.util.regex.Pattern.

Java: Finding the number of word matches in a given string

I am trying to find the number of word matches for a given string and keyword combination, like this:
public int matches(String keyword, String text){
// ...
}
Example:
Given the following calls:
System.out.println(matches("t", "Today is really great, isn't that GREAT?"));
System.out.println(matches("great", "Today is really great, isn't that GREAT?"));
The result should be:
0
2
So far I found this: Find a complete word in a string java
This only returns if the given keyword exists but not how many occurrences. Also, I am not sure if it ignores case sensitivity (which is important for me).
Remember that substrings should be ignored! I only want full words to be found.
UPDATE
I forgot to mention that I also want keywords that are separated via whitespace to match.
E.g.
matches("today is", "Today is really great, isn't that GREAT?")
should return 1
Use a regular expression with word boundaries. It's by far the easiest choice.
int matches = 0;
Matcher matcher = Pattern.compile("\\bgreat\\b", Pattern.CASE_INSENSITIVE).matcher(text);
while (matcher.find()) matches++;
Your milage may vary on some foreign languages though.
How about taking advantage of indexOf ?
s1 = s1.toLowerCase(Locale.US);
s2 = s2.toLowerCase(Locale.US);
int count = 0;
int x;
int y = s2.length();
while((x=s1.indexOf(s2)) != -1){
count++;
s1 = s1.substr(x,x+y);
}
return count;
Efficient version
int count = 0;
int y = s2.length();
for(int i=0; i<=s1.length()-y; i++){
int lettersMatched = 0;
int j=0;
while(s1[i]==s2[j]){
j++;
i++;
lettersMatched++;
}
if(lettersMatched == y) count++;
}
return count;
For more efficient solution, you will have to modify KMP algorithm a little. Just google it, its simple.
well,you can use "split" to separate the words and find if there exists a word matches exactly.
hope that helps!
one option would be RegEx. Basically it sounds like you are looking to match a word with any punctuation on the left or right. so:
" great."
" great!"
" great "
" great,"
"Great"
would all match, but
"greatest"
wouldn't

Categories

Resources