I need some help to model this regular expression. I think it'll be easier with an example. I need a regular expression that matches a comma, but only if it's not inside this structure: "( )", like this:
,a,b,c,d,"("x","y",z)",e,f,g,
Then the first five and the last four commas should match the expression, the two between xyz and inside the ( ) section shouldn't.
I tried a lot of combinations but regular expressions is still a little foggy for me.
I want it to use with the split method in Java. The example is short, but it can be much more longer and have more than one section between "( and )". The split method receives an expression and if some text (in this case the comma) matches the expression it will be the separator.
So, want to do something like this:
String keys[] = row.split(expr);
System.out.println(keys[0]); // print a
System.out.println(keys[1]); // print b
System.out.println(keys[2]); // print c
System.out.println(keys[3]); // print d
System.out.println(keys[4]); // print "("x","y",z)"
System.out.println(keys[5]); // print e
System.out.println(keys[6]); // print f
System.out.println(keys[7]); // print g
Thanks!
You can do this with a negative lookahead. Here's a slightly simplified problem to illustrate the idea:
String text = "a;b;c;d;<x;y;z>;e;f;g;<p;q;r;s>;h;i;j";
String[] parts = text.split(";(?![^<>]*>)");
System.out.println(java.util.Arrays.toString(parts));
// _ _ _ _ _______ _ _ _ _________ _ _ _
// [a, b, c, d, <x;y;z>, e, f, g, <p;q;r;s>, h, i, j]
Note that instead of ,, the delimiter is now ;, and instead of "( and "), the parentheses are simply < and >, but the idea still works.
On the pattern
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The * repetition specifier can be used to match "zero-or-more times" of the preceding pattern.
The (?!…) is a negative lookahead; it can be used to assert that a certain pattern DOES NOT match, looking ahead (i.e. to the right) of the current position.
The pattern [^<>]*> matches a sequence (possibly empty) of everything except parentheses, finally followed by a paranthesis which is of the closing type.
Putting all of the above together, we get ;(?![^<>]*>), which matches a ;, but only if we can't see the closing parenthesis as the first parenthesis to its right, because witnessing such phenomenon would only mean that the ; is "inside" the parentheses.
This technique, with some modifications, can be adapted to the original problem. Remember to escape regex metacharacters ( and ) as necessary, and of course " as well as \ in a Java string literal must be escaped by preceding with a \.
You can also make the * possessive to try to improve performance, i.e. ;(?![^<>]*+>).
References
regular-expressions.info/Character class, Repetition, Lookarounds, Possessive
Try this one:
(?![^(]*\)),
It worked for me in my testing, grabbed all commas not inside parenthesis.
Edit: Gopi pointed out the need to escape the slashes in Java:
(?![^(]*\\)),
Edit: Alan Moore pointed out some unnecessary complexity. Fixed.
If the parens are paired correctly and cannot be nested, you can split the text first at parens, then process the chunks.
List<String> result = new ArrayList<String>();
String[] chunks = text.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) == 0) {
String[] atoms = chunks[i].split(",");
for (int j = 0; j < atoms.length; j++)
result.add(atoms[j]);
}
else
result.add(chunks[i]);
}
Well,
After some tests I just found an answer that's doing what I need till now. At this moment, all itens inside the "( ... )" block are inside "" too, like in: "("a", "b", "c")", then, the regex ((?<!\"),)|(,(?!\")) works great for what I want!
But I still looking for one that can found the commas even if there's no "" in the inside terms.
Thankz for the help guyz.
This should do what you want:
(".*")|([a-z])
I didnt check in java but if you test it with http://www.fileformat.info/tool/regex.htm
the groups $1 and $2 contain the right values, so they match and you should get what you want.
A littlte be trickier this will get if you have other complexer values than a-z in between the commas.
If I understand the split correctly, dont use it, just fill your array with the backreference $0, $0 holds the values you are looking for.
Maybe a match function would be a better way and working with the values is better, cause you will get this really simple regExp. the others solutions I see so far are very good, no question aabout that, but they are really complicated and in 2 weeks you don't really know what the rexExp even did exactly.
By inversing the problem itself, the problem gets often simpler.
I had the same issue. I choose Adam Schmideg answer and improve it.
I had to deal with these 3 string for example :
France (Grenoble, Lyon), Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
The idea was to have :
France (Grenoble, Lyon)
or Germany (Berlin, Munich)
Italy, Suede, Belgium, Portugal
France, Italy (Torino), Spain (Bercelona, Madrid), Austria
I choose not to use regex because I was 100% of what I was doing and that it would work in any case.
String[] chunks = input.split("[()]");
for (int i = 0; i < chunks.length; i++) {
if ((i % 2) != 0) {
chunks[i] = "("+chunks[i].replaceAll(",", ";")+")";
}
}
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < chunks.length; i++) {
buffer.append(chunks[i]);
}
String s = buffer.toString();
String[] output = s.split(",");
Related
I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.
(?<!^)(?=[A-Z])
It works as expected:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
For example with Java:
String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}
My problem is that it does not work in some cases:
Case 1: VALUE -> V / A / L / U / E
Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext
To my mind, the result shoud be:
Case 1: VALUE
Case 2: eclipse / RCP / Ext
In other words, given n uppercase chars:
if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
if the n chars are at the end, the group should be: (n chars).
Any idea on how to improve this regex?
The following regex works for all of the above examples:
public static void main(String[] args)
{
for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
System.out.println(w);
}
}
It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".
The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.
It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:
(?<=[a-z])(?=[A-Z])
Here is how this regex splits your example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCPExt
The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.
Addendum - Improved version
Note: This answer recently got an upvote and I realized that there is a better way...
By adding a second alternative to the above regex, all of the OP's test cases are correctly split.
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
Here is how the improved regex splits the example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCP / Ext
Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.
Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase
I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:
((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))
and here's an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
; (^[a-z]+) Match against any lower-case letters at the start of the string.
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)
Here I'm separating each word with a space, so here are some examples of how the string is transformed:
ThisIsATitleCASEString => This Is A Title CASE String
andThisOneIsCamelCASE => and This One Is Camel CASE
This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:
((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))
and an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
; (^[a-z]+) Match against any lower-case letters at the start of the command.
; ([0-9]+) Match against one or more consecutive numbers (anywhere in the string, including at the start).
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)
And here are some examples of how a string with numbers is transformed with this regex:
myVariable123 => my Variable 123
my2Variables => my 2 Variables
The3rdVariableIsHere => The 3 rdVariable Is Here
12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too
To handle more letters than just A-Z:
s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");
Either:
Split after any lowercase letter, that is followed by uppercase letter.
E.g parseXML -> parse, XML.
or
Split after any letter, that is followed by upper case letter and lowercase letter.
E.g. XMLParser -> XML, Parser.
In more readable form:
public class SplitCamelCaseTest {
static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";
static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
);
public static String splitCamelCase(String s) {
return SPLIT_CAMEL_CASE.splitAsStream(s)
.collect(joining(" "));
}
#Test
public void testSplitCamelCase() {
assertEquals("Camel Case", splitCamelCase("CamelCase"));
assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
assertEquals("XML Parser", splitCamelCase("XMLParser"));
assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
assertEquals("VALUE", splitCamelCase("VALUE"));
}
}
Brief
Both top answers here provide code using positive lookbehinds, which, is not supported by all regex flavours. The regex below will capture both PascalCase and camelCase and can be used in multiple languages.
Note: I do realize this question is regarding Java, however, I also see multiple mentions of this post in other questions tagged for different languages, as well as some comments on this question for the same.
Code
See this regex in use here
([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)
Results
Sample Input
eclipseRCPExt
SomethingIsWrittenHere
TEXTIsWrittenHERE
VALUE
loremIpsum
Sample Output
eclipse
RCP
Ext
Something
Is
Written
Here
TEXT
Is
Written
HERE
VALUE
lorem
Ipsum
Explanation
Match one or more uppercase alpha character [A-Z]+
Or match zero or one uppercase alpha character [A-Z]?, followed by one or more lowercase alpha characters [a-z]+
Ensure what follows is an uppercase alpha character [A-Z] or word boundary character \b
You can use StringUtils.splitByCharacterTypeCamelCase("loremIpsum") from Apache Commons Lang.
You can use the expression below for Java:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)
Instead of looking for separators that aren't there you might also considering finding the name components (those are certainly there):
String test = "_eclipse福福RCPExt";
Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);
Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
// matches should be consecutive
if (componentMatcher.start() != endOfLastMatch) {
// do something horrible if you don't want garbage in between
// we're lenient though, any Chinese characters are lucky and get through as group
String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
components.add(startOrInBetween);
}
components.add(componentMatcher.group(1));
endOfLastMatch = componentMatcher.end();
}
if (endOfLastMatch != test.length()) {
String end = test.substring(endOfLastMatch, componentMatcher.start());
components.add(end);
}
System.out.println(components);
This outputs [eclipse, 福福, RCP, Ext]. Conversion to an array is of course simple.
I can confirm that the regex string ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b) given by ctwheels, above, works with the Microsoft flavour of regex.
I would also like to suggest the following alternative, based on ctwheels' regex, which handles numeric characters: ([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b).
This able to split strings such as:
DrivingB2BTradeIn2019Onwards
to
Driving B2B Trade in 2019 Onwards
A JavaScript Solution
/**
* howToDoThis ===> ["", "how", "To", "Do", "This"]
* #param word word to be split
*/
export const splitCamelCaseWords = (word: string) => {
if (typeof word !== 'string') return [];
return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};
I need a regular expression that will match groups of characters in a string. Here's an example string:
qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT
It should match
(match group) "result"
(1) "q"
(2) "wwwwwwwww"
(3) "eeeee"
(4) "rr"
(5) "t"
(6) "yyyyy"
(7) "qqqq"
(8) "w"
(9) "EE"
(10) "r"
(11) "TTT"
after doing some research, this is the best I could come up with
/(.)(\1*)/g
The problem I'm having is that the only way to use the \1 back-reference is to capture the character first. If I could reference the result of a non capturing group I could solve this problem but after researching I don't think it's possible.
How about /((.)(\2*))/g? That way, you match the group as a whole (I'm assuming that that's what you want, and that's what's lacking from the solution you found).
Looks like you need to use a Matcher in a loop:
Pattern p = Pattern.compile("((.)\\2*)");
Matcher m = p.matcher("qwwwwwwwwweeeeerrtyyyyyqqqqwEErTTT");
while (m.find()) {
System.out.println(m.group(1));
}
Outputs:
q
wwwwwwwww
eeeee
rr
t
yyyyy
qqqq
w
EE
r
TTT
Assuming what #cruncher said as a premise is true: "we want to catch repeating letter groups without knowing beforehand which letter should be repeating" then:
/((a*?+)|(b*?+)|(c*?+)|(d*?+)|(e*?+)|(f*?+)|(g*?+)|(h*?+))/
The above RegEx should allow the capture of repeating letter groups without hardcoding a particular order in which they would occur.
The ?+ is a reluctant possesive quantifier which helps us not waste RAM space by not saving previously valid backtracking cases if the current case is valid.
Since you did tag java, I'll give an alternative non-regex solution(I believe in requirements being the end product, not the method by which you get there).
String repeat = "";
char c = '';
for(int i = 0 ; i < s.length() ; i++) {
if(s.charAt(i) == c) {
repeat += c;
} else {
if(!repeat.isEmpty())
doSomething(repeat); //add to an array if you want
c = s.charAt(i);
repeat = "" + c;
}
}
if(!repeat.isEmpty())
doSomething(repeat);
I'm using this regex:
([\w\s]+)(=|!=)([\w\s]+)( (or|and) ([\w\s]+)(=|!=)([\w\s]+))*
to match a string such as this: i= 2 or i =3 and k!=4
When I try to extract values using m.group(index), I get:
(i, =, 2, **and k!=4**, and, k, ,!=, 4).
Expected output: (i, =, 2, or, i, =, 3, and, k , !=, 4)
How do i extract the values correctly?
P.S. m.matches() returns true.
you are trying to match with a regexp on an expression...you might want to use a parser, because this regexp (when you have it) can't be extended further..but a parser can be extended at any time
for example, consider using antlr (ANTLR: Is there a simple example?)
This is because your third set of parens (the one that you use for repeating expressions) is what's confusing you. Try using a non-capturing parens:
([\w\s]+)(=|!=)([\w\s]+)(?: (or|and) ([\w\s]+)(=|!=)([\w\s]+))*
Description
Why not simplify your expression to match exactly what you're looking for?
!?=|(?:or|and)|\b(?:(?!or|and)[\w\s])+\b
Example
Live Demo hover over the blue bubbles in the text area to see exactly what is matched
Sample Text
i= 2 or i =1234 and k!=4
Matches Found
[0][0] = i
[1][0] = =
[2][0] = 2
[3][0] = or
[4][0] = i
[5][0] = =
[6][0] = 1234
[7][0] = and
[8][0] = k
[9][0] = !=
[10][0] = 4
Everything in brackets makes a capturing group which you can later access via index. But you can make the group which you do not need non-capturing: (?: ... ), then it will not be considered at Matcher.group(int).
I need to extract substrings from a string:
Given string: "< If( ( h == v ) ): { [ < j = (i - f) ;>, < k = (g + t) ;> ] }>"
I need two substrings: "j = (i - f)" and "k = (g + t)".
For this I tried user pattern regex. Here's my code:
Pattern pattern = Pattern.compile("[<*;>]");
Matcher matcher = pattern.matcher(out.get(i).toString());
while (matcher.find())
{
B2.add(matcher.group());
}
out.get(i).toString() is my input string. B2 is an ArrayList which will contain the two extracted substrings.
But, after running the above code, the output I am getting is : [<, <, ;, >, <, ;, >, >].
My pattern is not working! Your help is very much appreciated.
Thanks in advance!
You can use the expression <([^<]+);>.
This will match things between < and ;>
Pattern pattern = Pattern.compile("<([^<]+);>");
Matcher matcher = pattern.matcher(out.get(i).toString());
while (matcher.find())
{
B2.add(matcher.group(1));
}
You can see the results on regexplanet: http://fiddle.re/5rty6
your [ and ] are causing you problems. those symbols mean: "match one among the symbols inside of these" If you remove those, you'll get better results. You'll also have to escape your pointy brackets when you do that.
The next step will be to capture the groups. you normally use ( and ) for that.
You'll also have to worry about nasty artifacts such as that < at the beginning of the string which will mess with your regex. in order to deal with that, you'll need to exclude those from your regex.
You might end up with
"\<([^<>]*?)\>"
as your regex. Be sure to check the specific java documentation and to escape your \ for a final result of
"\\<([^<>]*?)\\>"
If you're wanting to next other < and > inside your pointy brackets, regex has a lot of trouble with that kind of thing, and maybe you should try a different method
Here's a sample regex
I have a regular expression that I use to find matches of a list of coma-separated words between <> inside a string, like "Hello <a1> sqjsjqk <b1,b2> dsjkfjkdsf <c1,c2,c3> ffsd" in the example
I want to use capturing groups to retain each word between the braces:
Here is my expression: < (\w+) (?: ,(\w+) )* > (spaces are added for readability but not a part of the pattern)
Parenthesis are for creating capturing groups, (?: ) is for creating a non capturing group, because I don't want to retain the coma.
Here is my test code:
#Test
public void test() {
String patternString = "<(\\w+)(?:,(\\w+))*>";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher("Hello <a1> sqjsjqk <b1,b2> dsjkfjkdsf <c1,c2,c3> ffsd");
while(matcher.find()) {
System.out.println("== Match ==");
MatchResult matchResult = matcher.toMatchResult();
for(int i = 0; i < matchResult.groupCount(); i++) {
System.out.println(" " + matchResult.group(i + 1));
}
}
}
This is the output produced:
== Match ==
a1
null
== Match ==
b1
b2
== Match ==
c1
c3
And here is what I wanted:
== Match ==
a1
== Match ==
b1
b2
== Match ==
c1
c2
c3
From this I understand that there is exactly as many groups as the number of capturing groups in my expression, but this is not what I want, because I need all the substring that were recognized as the \w+
Is there any chance to get what I want with a single RegExp, or should I finish the job with split(","), trim(), etc...
As far as I know .NET has the only regex engine out there, that can return multiple captures for a single capturing group. So what you are asking for is not possible in Java (at least not the way you asked for).
In your case this problem can however be solved to a certain extent. If you can be sure that there will never be an unmatched closing >, you can make the stuff you want to capture the full match, and require the correct position through a lookahead:
"\\w+(?=(?:,\\w+)*>)"
This can never match "words" outside of <...>, because they cannot get past the opening < to match the closing >. Of course that makes it hard to distinguish between elements from different sets of <...>.
Alternatively (and I suppose that is even better, because it's safer, and more readable), go for a two-step algorithm. First match
"<([\\w,]*)>"
Then split every result's first capture at ,.