Regex to capture unpaired brackets or parentheses - java

As the title indicates, please, how do I capture unpaired brackets or parentheses with regex, precisely, in java, being new to java. For instance, supposing I have the string below;
Programming is productive, (achieving a lot, and getting good results), it is often 1) demanding and 2) costly.
How do I capture 1) and 2).
I have tried:
([^\(\)][\)])
But, the result I am getting includes s) as below, instead of 1) and 2):
s), 1) and 2)
I have checked the link: Regular expression to match balanced parentheses, but, the question seem to be referring to recursive or nested structures, which is quite different from my situation.
My situation is to match the right parenthesis or right bracket, along with any associated text that does not have an associated left parenthesis or bracket.

Maybe,
\b\d+\)
might simply return the desired output, I guess.
Demo 1
Another way is to see what left boundary you might have, which in this case, I see digits, then what other chars we'd have prior to the closing curly bracket, and then we can design some other simple expression similar to:
\b\d[^)]*\)
Demo 2
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "\\b\\d[^)]*\\)";
final String string = "Programming is productive, (achieving a lot, and getting good results), it is often 1) demanding and 2) costly.\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1a b) demanding and 2a a) costly.\n\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1b) demanding and 2b) costly.\n\n"
+ "It is not supposed to match ( s s 1) \n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: 1)
Full match: 2)
Full match: 1a b)
Full match: 2a a)
Full match: 1b)
Full match: 2b)
Full match: 1)
RegEx Circuit
jex.im visualizes regular expressions:

This is not a regex solution (obviously) but I can't think of a good way to do it. This simply uses a stack to keep track of parens.
For the input String "(*(**)**) first) second) (**) (*ksks*) third) ** fourth)( **)
It prints out
first)
second)
third)
fourth)
All other parentheses are ignored because they are matched.
String s =
"(*(**)**) first) second) (**) (*ksks*) third) ** fourth)( **)";
Pattern p;
List<String> found = new ArrayList<>();
Stack<Character> tokens = new Stack<>();
int pcount = 0;
for (char c : s.toCharArray()) {
switch (c) {
case ' ':
tokens.clear();
break;
case '(':
pcount++;
break;
case ')':
pcount--;
if (pcount == -1) {
String v = ")";
while (!tokens.isEmpty()) {
v = tokens.pop() + v;
}
found.add(v);
pcount = 0;
}
break;
default:
tokens.push(c);
}
}
found.forEach(System.out::println);
Note: Integrating brackets (]) into the above would be a challenge (though not impossible) because one would need to check constructs like ( [ ) ] where it is unclear how to interpret it. That's why when specifying requirements of this sort they need to be spelled out precisely.

Related

Getting data between single and double quotes (special case)

I am writing a String parser that I use to parse all strings from a text file, The strings can be inside single or double quotes, Pretty simple right? well not really. I wrote a regex to match strings how I want. but it's giving me StackOverFlow error on big strings (I am aware java isn't really good with regex stuff on big strings), This is the regex pattern (['"])(?:(?!\1|\\).|\\.)*\1
This works good for all the string inputs that I need, but as soon as theres a big string it throws StackOverFlow error, I have read similar questions based on this, such as this which suggests to use StringUtils.substringsBetween, but that fails on strings like '""', "\\\""
So my question is what should I do to solve this issue? I can provide more context if needed, Just comment.
Edit: After testing the answer
Code:
public static void main(String[] args) {
final String regex = "'([^']*)'|\"(.*)\"";
final String string = "local b = { [\"\\\\\"] = \"\\\\\\\\\", [\"\\\"\"] = \"\\\\\\\"\", [\"\\b\"] = \"\\\\b\", [\"\\f\"] = \"\\\\f\", [\"\\n\"] = \"\\\\n\", [\"\\r\"] = \"\\\\r\", [\"\\t\"] = \"\\\\t\" }\n" +
"local c = { [\"\\\\/\"] = \"/\" }";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
Output:
Full match: "\\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t"
Group 1: null
Group 2: \\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t
Full match: "\\/"] = "/"
Group 1: null
Group 2: \\/"] = "/
It's not handling the escaped quotes correctly.
I would try without capture quote type/lookahead/backref to improve performance. See this question for escaped characters in quoted strings. It contains a nice answer that is unrolled. Try like
'[^\\']*(?:\\.[^\\']*)*'|"[^\\"]*(?:\\.[^\\"]*)*"
As a Java String:
String regex = "'[^\\\\']*(?:\\\\.[^\\\\']*)*'|\"[^\\\\\"]*(?:\\\\.[^\\\\\"]*)*\"";
The left side handles single quoted, the right double quoted strings. If either kind overbalances the other in your source, put that preferably on the left side of the pipe.
See this a demo at regex101 (if you need to capture what's inside the quotes, use groups)
For the overflow state, you would probably want to allocate whatever resources that'd be required. You'd likely want to design small benchmark tests and find out about the practical resources that might be necessary to finalize your task.
Another option would be to find some other strategies or maybe languages to solve your problem. For instance, if you could classify your strings into two categories of ' or " wrapped to find some other optimal solutions.
Otherwise, you might want to try designing simple expressions and avoid back-referencing, such as with:
'([^']*)'|"(.*)"
which would probably fail for some other inputs that you might have and we don't know of.
Or maybe present your question slightly more technical such that some experienced users might be able to provide better answers, such as this answer.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "'([^']*)'|\"(.*)\"";
final String string = "'\"\"'\n"
+ "\"\\\\\\\"\"";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: '""'
Group 1: ""
Group 2: null
Full match: "\\\""
Group 1: null
Group 2: \\\"
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Regex grouping and matching

I have a regex https://regex101.com/r/PPbhRn/1. Here i see that when "and" is captured, i am able to see some white spaces also captured above. Is there any way to get rid of that white spaces? and i want to know whether the pattern will match only when grouping is captured correctly?
String validRegex="(((?:[(]* ?[a-z][a-z]+ ?[)]*)|[(]* ?(NOT) (?:[(]* ?[a-z][a-z]+ ?[)]*) ?[)]*)( (AND|OR) ((?:[(]* ?[a-z][a-z]+ ?[)]*)|[(]* ?(NOT) (?:[(]* ?[a-z][a-z]+ ?[)]*) ?[)]*))*)";
String formula = "mean AND trip OR (mean OR mango) AND (mean AND orange) OR mango AND (test OR NOT help)";
Pattern p1 = Pattern.compile(validRegex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);
final Matcher matcher = p1.matcher(formula);
boolean result=MarketMeasureUtil.isValidFormula(formula);
System.out.println(result);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
System.out.println( matcher.group() + "starting at" + "index" + matcher.start()+ "and ending at index" +matcher.end() );
}
I'm not able to capture the groups properly, i need to capture groups like "mean AND trip" "OR" "mean or mango"..etc..
isValidFormula() invokes the regex.matches(). In our case matches works fine. Grouping is not working as expected
Regex are not suitable for this task. I doubt that it's even possible to validate the expression if you can add as many braces as you would like to.
You have to write a parser which builds the tree, using a class like:
class Node {
boolean[] isAnd = null;
Node[] children = null;
String literal = null;
Node(String literal) { // creator for literals
this.literal = literal;
}
Node(boolean[] isAnd) { // creator for intermediate nodes
this.isAnd = isAnd;
children = new Node[isAnd.length + 1];
}
}
And the method would look like this:
Node parse(String) throws ParseException { // returns the root
First you can remove the superfluous braces on the right and the left side by counting all braces, then you can find the 0-level ands and ors (i.e. those that are not in braces) and create an intermediate node, if you don't find any 0-level ands and ors then the string has to be a literal or it is invalid. If it is an intermediate node then you add the children by calling the parse method recursively with the substrings surrounding the 0-level ands and ors.
Looks like you created some kind of DSL.
You should consider using a parser or implementing your own if your "language" is not complex.
I assume you just evaluate OR/AND operations. It is very similar to code a calculator where AND (multiplication) takes precedence over OR (addition). Therefore you could implement your own.
You can first tokenize the statement and validate it but don't try to do both at the same time with regex. If validating is the only purpose you can end here.
Next if you have to evaluate the expression you can create a binary tree with the tokens (OR operand as left leaf and AND operand as right leaf for instance) and apply your grammar to evaluate the expression.

Efficient Regular Expression for big data, if a String contains a word

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
For example:
String keyword="ac";
String document"..." //few page long file
If i use :
if(document.contains(keyword) ){
//do something
}
It will also return true if document contains a word like "account";
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
Summary:
This is the summary: Hopefully it will be useful to some one else:
My regular expression would work but extremely impractical while
working with big data. (it didn't terminate)
#anubhava perfected the regular expression. it was easy to
understand and implement. It managed to terminate which is a big
thing. but it was still a bit slow. (Roughly about 240 seconds)
#Tomalak solution is abit complex to implement and understand but it
was the fastest solution. so hats off mate.(18 seconds)
so #Tomalak solution was ~15 times faster than #anubhava.
Don't think you need to have .* in your regex.
Try this regex:
String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";
Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.
Also you must be using Pattern.quote if your keyword contains special regex characters.
EDIT: You might use this regex if your keywords are separated by space.
String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";
The fastest-possible way to find substrings in Java is to use String.indexOf().
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
This would print:
Hit for "long" at position 44.
This should perform better than a pure regular expressions approach.
Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.
I've created a Gist with an alternative implementation that does that.

Searching for Variable Scope { } in Text

I need to identify {scope} in text, such as source code.
I'm starting with just a single line and will expand to search multiple lines, and exclude comments. I already have working code using Pattern Matcher, but I would like critiquing on how to improve such a search.
String line = "{{outside{inside}{inside2}}};";
String scopeOf = "outside";
findscope(line,scopeOf);
private static void findscope(String line,
String scopeOf) {
int layer = 1;
Pattern p = Pattern.compile(scopeOf);
Matcher m = p.matcher(line);
if (m.find()) {
int scopestart = m.start();
int scopeEnd = Integer.MIN_VALUE;
m.usePattern(Pattern.compile("\\{|\\}"));
while (m.find()) {
String group = m.group();
if (group.equals("{")) {
layer++;
} else if (group.equals("}")) {
layer--;
}
if (layer == 0) {
scopeEnd = m.start();
break;
}
}
System.out.println("Scope of " + scopeOf + " starts at " + scopestart +
" finishes at " + scopeEnd);
}
}
Well, you are using the wrong tool for the job (assuming you are also looking for nested scopes)
Note that regex (in the traditional form of regex) stands for Regular Expression - which is a way to describe a Regular Language.
However, the language L = { all words with legal scopings } is irregular - and thus cannot be identified by regex.
This langauge is actually Conext Free Langauge, and can be represented by a Context Free Grammer.
For parsing:
For relatively simple langauges (scoping is among them) - a deterministic push-down automaton is enough to verify them.
Some languages require non deterministic push down automaton - which is not very efficiently created, but there is a dynamic programming algorithm to parse them as well.
As a side note, there are some tools such as JavaCC that you can use to parse (and generate code/output) - have a look on them, but if you are simply looking for the scoping issue - it is probably an overkill.
Edit - pseudo code:
curr <- 0
count <- 0 //integer imitates the stack for this simple usage
l <- string.length()
while (curr < l):
if string.charAt(curr) == '{':
count++;
else if string.charAt(curr) == '}':
if curr <= 0:
return ERROR;
count--;
curr++;
if count != 0:
return ERROR;
return SUCCESS;
Note that in here we can use an integer to imitate the stack, in here an increase is basically a push() and a decrease is a pop().

Iterating through String with .find() in Java regex

I'm currently trying to solve a problem from codingbat.com with regular expressions.
I'm new to this, so step-by-step explanations would be appreciated. I could solve this with String methods relatively easily, but I am trying to use regular expressions.
Here is the prompt:
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
wordEnds("abcXY123XYijk", "XY") → "c13i"
wordEnds("XY123XY", "XY") → "13"
wordEnds("XY1XY", "XY") → "11"
etc
My code thus far:
String regex = ".?" + word+ ".?";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
String newStr = "";
while(m.find())
newStr += m.group().replace(word, "");
return newStr;
The problem is that when there are multiple instances of word in a row, the program misses the character preceding the word because m.find() progresses beyond it.
For example: wordEnds("abc1xyz1i1j", "1") should return "cxziij", but my method returns "cxzij", not repeating the "i"
I would appreciate a non-messy solution with an explanation I can apply to other general regex problems.
This is a one-liner solution:
String wordEnds = input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
This matches your edge case as a look ahead within a non-capturing group, then matches the usual (consuming) case.
Note that your requirements don't require iteration, only your question title assumes it's necessary, which it isn't.
Note also that to be absolutely safe, you should escape all characters in word in case any of them are special "regex" characters, so if you can't guarantee that, you need to use Pattern.quote(word) instead of word.
Here's a test of the usual case and the edge case, showing it works:
public static String wordEnds(String input, String word) {
word = Pattern.quote(word); // add this line to be 100% safe
return input.replaceAll(".*?(.)" + word + "(?:(?=(.)" + word + ")|(.).*?(?=$|." + word + "))", "$1$2$3");
}
public static void main(String[] args) {
System.out.println(wordEnds("abcXY123XYijk", "XY"));
System.out.println(wordEnds("abc1xyz1i1j", "1"));
}
Output:
c13i
cxziij
Use positive lookbehind and postive lookahead which are zero-width assertions
(?<=(.)|^)1(?=(.)|$)
^ ^ ^-looks for a character after 1 and captures it in group2
| |->matches 1..you can replace it with any word
|
|->looks for a character just before 1 and captures it in group 1..this is zero width assertion that doesn't move forward to match.it is just a test and thus allow us to capture the values
$1 and $2 contains your value..Go on finding till the end
So this should be like
String s1 = "abcXY123XYiXYjk";
String s2 = java.util.regex.Pattern.quote("XY");
String s3 = "";
String r = "(?<=(.)|^)"+s2+"(?=(.)|$)";
Pattern p = Pattern.compile(r);
Matcher m = p.matcher(s1);
while(m.find()) s3 += m.group(1)+m.group(2);
//s3 now contains c13iij
works here
Use regex as follows:
Matcher m = Pattern.compile("(.|)" + Pattern.quote(b) + "(?=(.?))").matcher(a);
for (int i = 1; m.find(); c += m.group(1) + m.group(2), i++);
Check this demo.

Categories

Resources