Merge three regex groups - java

I have three different sentences which contains repetetive parts.
I want to merge three different regex groups in one, and then replace all mathes to white space.
I am asking you for help, how should I megre these groups ?
String locked = "LOCKED (center)"; //LOCKED() - always the same part
String idle = "Idle (second)"; // Idle() - always the same part
String OK = "example-OK"; // -OK - always the same part
I've built three regular expressions, but they are split. How should i megre them ?
String forLocked = locked.replaceAll("^LOCKED\\s\\((.*)\\)", "$1");
String forIdle = idle.replaceAll("^Idle\\s\\((.*)\\)", "$1");
String forOK = OK.replaceAll("(.*)\\-OK", "$1");

I think this technically works, but it doesn't "feel great."
private static final String REGEX =
"^((Idle|LOCKED) *)?\\(?([a-z]+)\\)?(-OK)?$";
... your code ...
System.out.println(locked.replaceAll(REGEX, "$3"));
System.out.println(idle.replaceAll(REGEX, "$3"));
System.out.println(OK.replaceAll(REGEX, "$3"));
Output is:
center
second
example
Breaking down the expression:
^((Idle|LOCKED) *)? - Possibly starts with Idle or Locked followed by zero or more spaces
\\(?([a-z]+)\\)? - Has a sequence of lowercase characters possible embedded inside optional parentheses (also, we want to match that sequence)
(-OK)?$ - Possibly ends with the literal -OK.
There are still some issues though. The optional parentheses aren't in any way tied together, for example. Also, this would give false positives for compounds like Idle (second)-OK --> second.
I had a more stringent regex at first, but one of the additional challenges is to keep a consistent match index on the group you want to replace with (here, $3.) In other words, there's a whole set of regex where if you could use, say $k and $j in different situations, it would be easier. But, that goes against the whole point of having a single regex to begin with (if you need some pre-existing knowledge of the input you're about to match.) Better would be to assume that we know nothing about what is inside the identifiers locked, idle, and OK.

You can merge them with | like this:
String regex = "^LOCKED\\s\\((.*)\\)|^Idle\\s\\((.*)\\)|(.*)\\-OK$";
String forLocked = locked.replaceAll(regex, "$1");
String forIdle = idle.replaceAll(regex, "$2");
String forOK = OK.replaceAll(regex, "$3");

Related

regex to filter out string

I'm filtering out string using below regex
^(?!.*(P1 | P2)).*groupName.*$
Here group name is specific string which I replace at run time. This regex is already running fine.
I've two input strings which needs to pass through from this regex. Can't change ^(?!.*(P1 | P2)) part of regex, so would like to change regex after this part only. Its a very generic regex which is being used at so many places, so I have only place to have changes is groupName part of regex. Is there any way where only 2 string could pass through this regex ?
1) ADMIN-P3-UI-READ-ONLY
2) ADMIN-P3-READ-ONLY
In regex groupName is a just a variable which will be replaced at run time with required string. In this case I want 2 string to be passed, so groupName part can be replaced with READ-ONLY but it will pass 1 string too.
Can anyone suggest on this how to make this work ?
You could use negative lookBehind:
(?<!UI-)READ-ONLY
so there must be no UI- before READ-ONLY
You can add another lookahead at the very start of your pattern to further restrict what it matches because your pattern is of the "match-everything-but" type.
So, it may look like
String extraCondition = "^(?!.*UI)";
String regex = "^(?!.*(P1|P2)).*READ-ONLY.*$";
String finalRegex = extraCondition + regex;
The pattern will look like
^(?!.*UI)^(?!.*(P1|P2)).*READ-ONLY.*$
matching
^(?!.*UI) - no UI after any zero or more chars other than line break chars as many as possible from the start of string
^(?!.*(P1|P2)) - no P1 nor P2 after any zero or more chars other than line break chars as many as possible from the start of string
.*READ-ONLY - any zero or more chars other than line break chars as many as possible and then READ-ONLY
.*$ - the rest of the string. Note you may safely remove $ here unless you want to make sure there are no extra lines in the input string.

Regex matches with multiple patterns

I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.
Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com
Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]
The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.

Regex to parse strings that might or might not be delimited by ; into several groups

I have a case where I need to parse a string into several groups depending on a criteria
For example the below;
01%3A%35r%07%01P%88%00;WAP_GPRS
should be 2 groups
%3A%35r%07%01P%88%00
WAP_GPRS
Notice that I dont care about 01 at the beginning and there can be 0 or more substrings delimited by ; and I need them all in their own group.
Another one;
01%3A%35r%07%01P%88%00;KPN;A23B
should be 3 groups:
%3A%35r%07%01P%88%00
KPN
A23B
Basically, I dont need to care if alpha or numeric comes first. The issue is grouping expressions into their own which can be 0 or more times. Meaning the below
01%3A%35r%07%01P%88%00
should produce also one group of %3A%35r%07%01P%88%00
Why not just split your string on ;.
But before that you would need to remove that 01 before the first %, by using String#substring, as you are not showing in your required output: -
String str = "01%3A%35r%07%01P%88%00;WAP_GPRS";
// Remove `01` before first `%`
str = str.replace(str.substring(0, str.indexOf("%")), "");
String[] groups = str.split(";");
System.out.println(Arrays.toString(groups));
OUTPUT: -
[%3A%35r%07%01P%88%00, WAP_GPRS]
You don't need a regexp:
String data = "01%3A%35r%07%01P%88%00;KPN;A23B"
String groups = s.split(";")
for (String s: groups ){
System.out.println(s); // I'm printing each separate group
}
The deleting of the first two letters of the original string is another thing that has nothing to do with the group separation and you can do it with the substring method.
So I guess you need a regexp analog of split. That would require a repeated capturing group.
Bad news, some people have looked into a similar problem and haven't found the right answer:
https://stackoverflow.com/a/6836024/1665128
Good news, if you can live with some reasonable limit on the number of groups and you can add some code to identify the empty trailing ones, this may help:
([^;]*);?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?;?([^;]*)?

Regular expression to select all whitespace that isn't in quotes?

I'm not very good at RegEx, can someone give me a regex (to use in Java) that will select all whitespace that isn't between two quotes? I am trying to remove all such whitespace from a string, so any solution to do so will work.
For example:
(this is a test "sentence for the regex")
should become
(thisisatest"sentence for the regex")
Here's a single regex-replace that works:
\s+(?=([^"]*"[^"]*")*[^"]*$)
which will replace:
(this is a test "sentence for the regex" foo bar)
with:
(thisisatest"sentence for the regex"foobar)
Note that if the quotes can be escaped, the even more verbose regex will do the trick:
\s+(?=((\\[\\"]|[^\\"])*"(\\[\\"]|[^\\"])*")*(\\[\\"]|[^\\"])*$)
which replaces the input:
(this is a test "sentence \"for the regex" foo bar)
with:
(thisisatest"sentence \"for the regex"foobar)
(note that it also works with escaped backspaces: (thisisatest"sentence \\\"for the regex"foobar))
Needless to say (?), this really shouldn't be used to perform such a task: it makes ones eyes bleed, and it performs its task in quadratic time, while a simple linear solution exists.
EDIT
A quick demo:
String text = "(this is a test \"sentence \\\"for the regex\" foo bar)";
String regex = "\\s+(?=((\\\\[\\\\\"]|[^\\\\\"])*\"(\\\\[\\\\\"]|[^\\\\\"])*\")*(\\\\[\\\\\"]|[^\\\\\"])*$)";
System.out.println(text.replaceAll(regex, ""));
// output: (thisisatest"sentence \"for the regex"foobar)
Here is the regex which works for both single & double quotes (assuming that all strings are delimited properly)
\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)
It won't work with the strings which has quotes inside.
This just isn't something regexes are good at. Search-and-replace functions with regexes are always a bit limited, and any sort of nesting/containment at all becomes difficult and/or impossible.
I'd suggest an alternate approach: Split your string on quote characters. Go through the resulting array of strings, and strip the spaces from every other substring (whether you start with the first or second depends on whether you string started with a quote or not). Then join them back together, using quotes as separators. That should produce the results you're looking for.
Hope that helps!
PS: Note that this won't handle nested strings, but since you can't make nested strings with the ASCII double-qutoe character, I'm gonna assume you don't need that behaviour.
PPS: Once you're dealing with your substrings, then it's a good time to use regexes to kill those spaces - no containing quotes to worry about. Just remember to use the /.../g modifier to make sure it's a global replacement and not just the first match.
Groups of whitespace outside of quotes are separated by stuff that's a) not whitespace, or b) inside quotes.
Perhaps something like:
(\s+)([^ "]+|"[^"]*")*
The first part matches a sequence of spaces; the second part matches non-spaces (and non-quotes), or some stuff in quotes, either repeated any number of times. The second part is the separator.
This will give you two groups for each item in the result; just ignore the second element. (We need the parentheses for precidence rather than match grouping there.) Or, you could say, concatenate all the second elements -- though you need to match the first non-space word too, or in this example, make the spaces optional:
StringBuffer b = new StringBuffer();
Pattern p = Pattern.compile("(\\s+)?([^ \"]+|\"[^\"]*\")*");
Matcher m = p.matcher("this is \"a test\"");
while (m.find()) {
if (m.group(2) != null)
b.append(m.group(2));
}
System.out.println(b.toString());
(I haven't done much regex in Java so expect bugs.)
Finally This is how I'd do it if regexes were compulsory. ;-)
As well as Xavier's technique, you could simply do it the way you'd do it in C: just iterate over the input characters, and copy each to the new string if either it's non-space, or you've counted an odd number of quotes up to that point.
If there is only one set of quotes, you can do this:
String s = "(this is a test \"sentence for the regex\") a b c";
Matcher matcher = Pattern.compile("^[^\"]+|[^\"]+$").matcher(s);
while (matcher.find())
{
String group = matcher.group();
s = s.replace(group, group.replaceAll("\\s", ""));
}
System.out.println(s); // (thisisatest"sentence for the regex")abc
This isn't an exact solution, but you can accomplish your goal by doing the following:
STEP 1: Match the two segments
\\(([a-zA-Z ]\*)"([a-zA-Z ]\*)"\\)
STEP 2: remove spaces
temp = $1 replace " " with ""
STEP 3: rebuild your string
(temp"$2")

Java: Change an var's value (String) according to the value of an regex

I would like to know if it is possible (and if possible, how can i implement it) to manipulate an String value (Java) using one regex.
For example:
String content = "111122223333";
String regex = "?";
Expected result: "1111 2222 3333 ##";
With one regex only, I don't think it is possible. But you can:
first, replace (?<=(.))(?!\1) with a space;
then, use a string append to append " ##".
ie:
Pattern p = Pattern.compile("(?<=(.))(?!\\1)");
String ret = p.matcher(input).replaceAll(" ") + " ##";
If what you meant was to separate all groups, then drop the second operation.
Explanation: (?<=...) is a positive lookbehind, and (?!...) a negative lookahead. Here, you are telling that you want to find a position where there is one character behind, which is captured, and where the same character should not follow. And if so, replace with a space. Lookaheads and lookbehinds are anchors, and like all anchors (including ^, $, \A, etc), they do not consume characters, this is why it works.
OK, since the OP has redefined the problem (ie, a group of 12 digits which should be separated in 3 groups of 4, then followed by ##, the solution becomes this:
Pattern p = Pattern.compile("(?<=\\d)(?=(?:\\d{4})+$)");
String ret = p.matcher(input).replaceAll(" ") + " ##";
The regex changes quite a bit:
(?<=\d): there should be one digit behind;
(?=(?:\d{4})+$): there should be one or more groups of 4 digits afterwards, until the end of line (the (?:...) is a non capturing grouping -- not sure it really makes a difference for Java).
Validating that the input is 12 digits long can easily be done with methods which are not regex-related at all. And this validation is, in fact, necessary: unfortunately, this regex will also turn 12345 into 1 2345, but there is no way around that, for the reason that lookbehinds cannot match arbitrary length regexes... Except with the .NET languages. With them, you could have written:
(?<=^(?:\d{4})+)(?=(?:\d{4})+$

Categories

Resources