I am using Java to process text using regular expressions. I am using the following regular expression
^[\([0-9a-zA-Z]+\)\s]+
to match one or more letters or numbers in parentheses one or more times. For instance, I like to match
(aaa) (bb) (11) (AA) (iv)
or
(111) (aaaa) (i) (V)
I tested this regular expression on http://java-regex-tester.appspot.com/ and it is working. But when I use it in my code, the code does not compile. Here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("^[\([0-9a-zA-Z]+\)\s]+");
String[] words = pattern.split("(a) (1) (c) (xii) (A) (12) (ii)");
String w = pattern.
for(String s:words){
System.out.println(s);
}
}
}
I tried to use \ instead of \ but the regex gave different results than what I expected (it matches only one group like (aaa) not multiple groups like (aaa) (111) (ii).
Two questions:
How can I fix this regex and be able to match multiple groups?
How can I get the individual matches separately (like (aaa) alone and then (111) and so on). I tried pattern.split but did not work for me.
Firstly, you want to escape any backslashes in the quotation marks with another backslash. The Regex will treat it as a single backslash. (E.g. call a word character \w in quotation marks, etc.)
Secondly, you got to finish the line that reads:
String w = pattern.
That line explains why it doesn't compile.
Here is my final solution to match the individual groups of letters/numbers in brackets that appear at the beginning of a line and ignore the rest
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tester {
static ArrayList<String> listOfEnums;
public static void main(String[] args) {
listOfEnums = new ArrayList<String>();
Pattern pattern = Pattern.compile("^\\([0-9a-zA-Z^]+\\)");
String p = "(a) (1) (c) (xii) (A) (12) (ii) and the good news (1)";
Matcher matcher = pattern.matcher(p);
boolean isMatch = matcher.find();
int index = 0;
//once you find a match, remove it and store it in the arrayList.
while (isMatch) {
String s = matcher.group();
System.out.println(s);
//Store it in an array
listOfEnums.add(s);
//Remove it from the beginning of the string.
p = p.substring(listOfEnums.get(index).length(), p.length()).trim();
matcher = pattern.matcher(p);
isMatch = matcher.find();
index++;
}
}
}
1) Your regex is incorrect. You want to match individual groups of letters / numbers in brackets, and the current regex will match only a single string of one or more such groups. I.e. it will match
(abc) (def) (123)
as a single group rather than three separate groups.
A better regex that would match only up to the closing bracket would be
\([0-9a-zA-Z^\)]+\)
2) Java requires you to escape all backslashes with another backslash
3) The split() method will not do what you want. It will find all matches in your string then throw them away and return an array of what is left over. You want to use matcher() instead
Pattern pattern = Pattern.compile("\\([0-9a-zA-Z^\\)]+\\)");
Matcher matcher = pattern.matcher("(a) (1) (c) (xii) (A) (12) (ii)");
while (matcher.find()) {
System.out.println(matcher.group());
}
Related
I have a string that contains one or more (comma-separated) values, surrounded by quotes and enclosed in parentheses. So it can be of the type os IN ('WIN', 'MAC', 'LNU') (for multiple values) or just os IN ('WIN') for a single value.
I need to extract the values in a List.
I have tried this regex, but it captures all the values into one single list element as one whole String as 'WIN', 'MAC', instead of two String values of WIN and MAC -
List<String> matchList = new ArrayList<>();
Pattern regex = Pattern.compile("\\((.+?)\\)");
Matcher regexMatcher = regex.matcher(processedFilterString);
while (regexMatcher.find()) {//Finds Matching Pattern in String
matchList.add(regexMatcher.group(1));//Fetching Group from String
}
Result:
Input: os IN ('WIN', 'MAC')
Output:
['WIN', 'MAC']
length: 1
In it's current form, the regex matches one or more characters surrounded by parentheses and captures them in a group, which is probably why the result is just one string. How can I adapt it to capture each of the values separately?
Edit - Just adding some more details. The input string can have multiple IN clauses containing other criteria, such as id IN ('xxxxxx') AND os IN ('WIN', 'MAC'). Also, the length of the matched characters is not necessarily the same, so it could be - os IN ('WIN', 'MAC', 'LNUX').
You may try splitting the CSV string from the IN clause:
List<String> matchList = null;
Pattern regex = Pattern.compile("\\((.+?)\\)");
Matcher regexMatcher = regex.matcher(processedFilterString);
if (regexMatcher.find()) {
String match = regexMatcher.group(1).replaceAll("^'|'$", "");
String[] terms = match.split("'\\s*,\\s*'");
matchList = Arrays.stream(terms).collect(Collectors.toList());
}
Note that if your input string could contain multiple IN clauses, then the above would need to be modified to use a while loop.
What I see from the examples in your question, your regular expression needs to find strings of at least three upper-case letters enclosed in single quotes.
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Solution {
public static void main(String[] args) {
String s = "os IN ('WIN', 'MAC', 'LNUX')";
Pattern pattern = Pattern.compile("'([A-Z]{3,})'");
Matcher matcher = pattern.matcher(s);
List<String> list = new ArrayList<>();
while (matcher.find()) {
list.add(matcher.group(1));
}
System.out.println(list);
}
}
Running the above code produces the following output:
[WIN, MAC, LNUX]
Objective: for a given term, I want to check if that term exist at the start of the word. For example if the term is 't'. then in the sentance:
"This is the difficult one Thats it"
I want it to return "true" because of :
This, the, Thats
so consider:
public class HelloWorld{
public static void main(String []args){
String term = "t";
String regex = "/\\b"+term+"[^\\b]*?\\b/gi";
String str = "This is the difficult one Thats it";
System.out.println(str.matches(regex));
}
}
I am getting following Exception:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 7
/\bt[^\b]*?\b/gi
^
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.escape(Pattern.java:2416)
at java.util.regex.Pattern.range(Pattern.java:2577)
at java.util.regex.Pattern.clazz(Pattern.java:2507)
at java.util.regex.Pattern.sequence(Pattern.java:2030)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.util.regex.Pattern.matches(Pattern.java:1128)
at java.lang.String.matches(String.java:2063)
at HelloWorld.main(HelloWorld.java:8)
Also the following does not work:
import java.util.regex.*;
public class HelloWorld{
public static void main(String []args){
String term = "t";
String regex = "\\b"+term+"gi";
//String regex = ".";
System.out.println(regex);
String str = "This is the difficult one Thats it";
System.out.println(str.matches(regex));
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(str);
System.out.println(m.find());
}
}
Example:
{ This , one, Two, Those, Thanks }
for words This Two Those Thanks; result should be true.
Thanks
Since you're using the Java regex engine, you need to write the expressions in a way Java understands. That means removing trailing and leading slashes and adding flags as (?<flags>) at the beginning of the expression.
Thus you'd need this instead:
String regex = "(?i)\\b"+term+".*?\\b"
Have a look at regular-expressions.info/java.html for more information. A comparison of supported features can be found here (just as an entry point): regular-expressions.info/refbasic.html
In Java we don't surround regex with / so instead of "/regex/flags" we just write regex. If you want to add flags you can do it with (?flags) syntax and place it in regex at position from which flag should apply, for instance a(?i)a will be able to find aa and aA but not Aa because flag was added after first a.
You can also compile your regex into Pattern like this
Pattern pattern = Pattern.compile(regex, flags);
where regex is String (again not enclosed with /) and flag is integer build from constants from Pattern like Pattern.DOTALL or when you need more flags you can use Pattern.CASE_INSENSITIVE|Pattern.MULTILINE.
Next thing which may confuse you is matches method. Most people are mistaken by its name, because they assume that it will try to check if it can find in string element which can be matched by regex, but in reality, it checks if entire string can be matched by regex.
What you seem to want is mechanism to test of some regex can be found at least once in string. In that case you may either
add .* at start and end of your regex to let other characters which are not part of element you want to find be matched by regex engine, but this way matches must iterate over entire string
use Matcher object build from Pattern (representing your regex), and use its find() method, which will iterate until it finds match for regex, or will find end of string. I prefer this approach because it will not need to iterate over entire string, but will stop when match will be found.
So your code could look like
String str = "This is the difficult one Thats it";
String term = "t";
Pattern pattern = Pattern.compile("\\b"+term, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.find());
In case your term could contain some regex special characters but you want regex engine to treat them as normal characters you need to make sure that they will be escaped. To do this you can use Pattern.quote method which will add all necessary escapes for you, so instead of
Pattern pattern = Pattern.compile("\\b"+term, Pattern.CASE_INSENSITIVE);
for safety you should use
Pattern pattern = Pattern.compile("\\b"+Pattern.quote(term), Pattern.CASE_INSENSITIVE);
String regex = "(?i)\\b"+term;
In Java, the modifiers must be inserted between "(?" and ")" and there is a variant for turning them off again: "(?-" and ")".
For finding all words beginning with "T" or "t", you may want to use Matcher's find method repeatedly. If you just need the offset, Matcher's start method returns the offset.
If you need to match the full word, use
String regex = "(?i)\\b"+term + "\\w*";
String str = "This is the difficult one Thats it";
String term = "t";
Pattern pattern = Pattern.compile("^[+"+term+"].*",Pattern.CASE_INSENSITIVE);
String[] strings = str.split(" ");
for (String s : strings) {
if (pattern.matcher(s).matches()) {
System.out.println(s+"-->"+true);
} else {
System.out.println(s+"-->"+false);
}
}
Trying to make a regex that grabs all words like lets just say, chicken, that are not in brackets. So like
chicken
Would be selected but
[chicken]
Would not. Does anyone know how to do this?
String template = "[chicken]";
String pattern = "\\G(?<!\\[)(\\w+)(?!\\])";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(template);
while (m.find())
{
System.out.println(m.group());
}
It uses a combination of negative look-behind and negative look-aheads and boundary matchers.
(?<!\\[) //negative look behind
(?!\\]) //negative look ahead
(\\w+) //capture group for the word
\\G //is a boundary matcher for marking the end of the previous match
(please read the following edits for clarification)
EDIT 1:
If one needs to account for situations like:
"chicken [chicken] chicken [chicken]"
We can replace the regex with:
String regex = "(?<!\\[)\\b(\\w+)\\b(?!\\])";
EDIT 2:
If one also needs to account for situations like:
"[chicken"
"chicken]"
As in one still wants the "chicken", then you could use:
String pattern = "(?<!\\[)?\\b(\\w+)\\b(?!\\])|(?<!\\[)\\b(\\w+)\\b(?!\\])?";
Which essentially accounts for the two cases of having only one bracket on either side. It accomplishes this through the | which acts as an or, and by using ? after the look-ahead/behinds, where ? means 0 or 1 of the previous expression.
I guess you want something like:
final Pattern UNBRACKETED_WORD_PAT = Pattern.compile("(?<!\\[)\\b\\w+\\b(?!])");
private List<String> findAllUnbracketedWords(final String s) {
final List<String> ret = new ArrayList<String>();
final Matcher m = UNBRACKETED_WORD_PAT.matcher(s);
while (m.find()) {
ret.add(m.group());
}
return Collections.unmodifiableList(ret);
}
Use this:
/(?<![\[\w])\w+(?![\w\]])/
i.e., consecutive word characters with no square bracket or word character before or after.
This needs to check both left and right for both a square bracket and a word character, else for your input of [chicken] it would simply return
hicke
Without look around:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatchingTest
{
private static String x = "pig [cow] chicken bull] [grain";
public static void main(String[] args)
{
Pattern p = Pattern.compile("(\\[?)(\\w+)(\\]?)");
Matcher m = p.matcher(x);
while(m.find())
{
String firstBracket = m.group(1);
String word = m.group(2);
String lastBracket = m.group(3);
if ("".equals(firstBracket) && "".equals(lastBracket))
{
System.out.println(word);
}
}
}
}
Output:
pig
chicken
A bit more verbose, sure, but I find it more readable and easier to understand. Certainly simpler than a huge regular expression trying to handle all possible combinations of brackets.
Note that this won't filter out input like [fence tree grass]; it will indicate that tree is a match. You cannot skip tree in that without a parser. Hopefully, this is not a case you need to handle.
I want to use Pattern and Matcher to return the following string as multiple variables.
ArrayList <Pattern> pArray = new ArrayList <Pattern>();
pArray.add(Pattern.compile("\\[[0-9]{2}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}\\]"));
pArray.add(Pattern.compile("\\[\\d{1,5}\\]"));
pArray.add(Pattern.compile("\\[[a-zA-Z[^#0-9]]+\\]"));
pArray.add(Pattern.compile("\\[#.+\\]"));
pArray.add(Pattern.compile("\\[[0-9]{10}\\]"));
Matcher iMatcher;
String infoString = "[03/12/13 10:00][30][John Smith][5554215445][#Comment]";
for (int i = 0 ; i < pArray.size() ; i++)
{
//out.println(pArray.get(i).toString());
iMatcher = pArray.get(i).matcher(infoString);
while (dateMatcher.find())
{
String found = iMatcher.group();
out.println(found.substring(1, found.length()-1));
}
}
}
the program outputs:
[03/12/13 10:00]
[30]
[John Smith]
[\#Comment]
[5554215445]
The only thing I need is to have the program not print the brackets and the # character.
I can easily avoid printing the brackets using substrings inside the loop but I cannot avoid the # character. # is only a comment indentifier in the string.
Can this be done inside the loop?
How about this?
public static void main(String[] args) {
String infoString = "[03/12/13 10:00][30][John Smith][5554215445][#Comment]";
final Pattern pattern = Pattern.compile("\\[#?(.+?)\\]");
final Matcher matcher = pattern.matcher(infoString);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
You just need to make the .+ non greedy and it will match everything between square brackets. We then use a match group to grab what we want rather than using the whole matched pattern, a match group is represented by (pattern). The #? matches a hash before the match group so that it doesn't get into the group.
The match group is retreived using matcher.group(1).
Output:
03/12/13 10:00
30
John Smith
5554215445
Comment
Use lookaheads. i.e. change all your \\[ (in your regex) with positive lookbehind:
(?<=\\[)
and then change all your \\] (in your regex) with positive lookahead:
(?=\\])
finally change \\[# (in your regex) with positive lookbehind:
(?<=\\[#)
I'm looking to split space-delimited strings into a series of search terms. However, in doing so I'd like to ignore spaces within parentheses. For example, I'd like to be able to split the string
a, b, c, search:(1, 2, 3), d
into
[[a] [b] [c] [search:(1, 2, 3)] [d]]
Does anyone know how to do this using regular expressions in Java?
Thanks!
This isn't a full regex, but it'll get you there:
(\([^)]*\)|\S)*
This uses a common trick, treating one long string of characters as if it were a single character. On the right side we match non-whitespace characters with \S. On the left side we match a balanced set of parentheses with anything in between.
The end result is that a balanced set of parentheses is treated as if it were a single character, and so the regex as a whole matches a single word, where a word can contain these parenthesized groups.
(Note that because this is a regular expression it can't handle nested parentheses. One set of parentheses is the limit.)
This problem had another solution that wasn't mentioned, so I'll post it here for completion. This situation is similar to this question to ["regex-match a pattern, excluding..."][4]
We can solve this with a beautifully-simple regex:
\([^)]*\)|(\s*,\s*)
The left side of the alternation | matches complete (parentheses). We will ignore these matches. The right side matches and captures commas and surrounding spaces to Group 1, and we know they are the right apostrophes because they were not matched by the expression on the left. We will replace these commas by something distinctive, then split.
This program shows how to use the regex (see the results at the bottom of the online demo):
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "a, b, c, search:(1, 2, 3), d";
Pattern regex = Pattern.compile("\\([^)]*\\)|(\\s*,\\s*)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...