Regular expression query (runtime customizable) - java

I have a special requirement, my regular expression pattern will be determined at run time for say i have a date and will like it to be checked against mm-dd-yyyy or mm/dd/yyyy or d.mm.yyyy something basically i would be feeding the pattern as NN-NN-TTTT where N mean a number and T means a letter and the expression can be anything. Can we write any regular expression that will work for this kind of requirement?
my form will look like as displayed in http://jsfiddle.net/E2EHZ/ data will matched corresponding to pattern specified in the text box
T - letter
N - Numeric
A - Alphanum

So essentially you would have your users enter a pattern containing T, N or A as placeholders with other characters that need to match literally in between? If so, then it's rather easy: Just replace your placeholders by appropriate character classes, quote the rest (so regex metacharacters are escaped) and use the result as a regex.
First escape everthing that is not A, N or T. How to do this varies by language, but essentially you'd replace [^ANT]+ by an escaped version of the match. In C# it might look like this:
Regex.Replace(s, "[^ANT]+", m => Regex.Escape(m.Value));
or in Java:
s.replaceAll("[^ANT]+", "\\Q$0\\E"
The translations to perform then are easy:
T → [a-zA-Z]
N → [0-9]
A → [0-9a-zA-Z]
That is, assuming ASCII-only. For Unicode you might want
T → \p{L}
N → \p{Nd}
A → [\p{L}\p{Nd}]
instead. Also note that if you perform simple string replacements you'll need to replace A first with the ASCII versions and N first for the Unicode variants to avoid replacing it in subsequent results.
In the end you might want to prefix your string with ^ and suffix it with $ if you want to match complete strings.
A sample implementation in C# (with a tiny optimisation):
string CreateRegex(string pattern) {
string result = Regex.Replace(pattern, "[^ANT]+", m => Regex.Escape(m.Value));
result = Regex.Replace(result, "A+", m => "[0-9a-zA-Z]" + (m.Length > 1 ? "{"+m.Length+"}" : ""));
result = Regex.Replace(result, "T+", m => "[a-zA-Z]" + (m.Length > 1 ? "{"+m.Length+"}" : ""));
result = Regex.Replace(result, "N+", m => "[0-9]" + (m.Length > 1 ? "{"+m.Length+"}" : ""));
return "^" + result + "$";
}
which for example results in the following:
NN-NN-TTTT → ^[0-9]{2}-[0-9]{2}-[a-zA-Z]{4}$
*(#&#^(&%(# AA-AA-NN-TTTTTTTT lreglig → \*\(#&\#\^\(&%\(#\ \ [0-9a-zA-Z]{2}-[0-9a-zA-Z]{2}-[0-9]{2}-[a-zA-Z]{8}\ lreglig
Or in Java (without said optimisation, because I cannot figure out how to use a function as replacement):
String createRegex(String pattern) {
String result = pattern.replaceAll("[^ANT]+", "\\Q$0\\E");
result = result.replaceAll("A", "[0-9a-zA-Z]");
result = result.replaceAll("T", "[a-zA-Z]");
result = result.replaceAll("N", "[0-9]");
return "^" + result + "$";
}
The resulting regexes will be a bit longer because the code above won't use repetition for identical tokens.

Related

Partially mask data of a group of number using regex

I would like to partially mask data using regex. Here is the input :
123-12345-1234567
And here is what I'd like as output :
1**-*****-*****67
I figure out how to replace for the last group but I don't know to do for the rest of the data.
String s = "123-12345-1234567";
System.out.println(s.replaceAll("\\d(?=\\d{2})", "*")); // output is *23-***45-*****67
Also, I'd like to use only regex because I have different type of data, so different type of mask. I don't want to create functions for each type of data.
For example :
AAAAAAAAA // becomes ********AA
12334567 // becomes 123******
Thanks for your help !
We can use the following regex replacement approach:
String input = "123-12345-1234567";
String output = input.substring(0, 1) +
input.substring(1, input.length()-2).replaceAll("\\d", "*") +
input.substring(input.length()-2);
System.out.println(output); // 1**-*****-*****67
Here we concatenate together the first digit, followed by the middle portion with all digits replaced by *, along with the final two digits.
Edit: A pure regex solution, which, however, is more lines of code than the above and might be less performant.
String input = "123-12345-1234567";
String pattern = "^(\\d)(.*)(\\d{2})$";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
if (m.find()) {
String output = m.group(1) + m.group(2).replaceAll("\\d", "*") + m.group(3);
System.out.println(output); // 1**-*****-*****67
}
Java supports a fixed quantifier in a lookbehind, so what you might do is use a pattern with an alternation to account for the different scenario's if you must use a regex only.
Using the lookarounds you can select a single character to be replaced by *
Note that this is hard to maintain, and it would be a better option to write separate functions for the different data formats using separate patterns or string functions (perhaps accompanied by unit tests)
(?<=^\d{3,7})\d(?=\d*$)|(?<=^[A-Z]{0,6})[A-Z](?=[A-Z]*$)|\d(?<=^\d{2,3})(?=\d?-\d{5}-\d{7}$)|\d(?<=^\d{3}-\d{1,5}(?:-\d{1,5})?)
The separate parts match:
(?<=^\d{3,7})\d(?=\d*$) Match a digit asserting 3-7 digits to the left and only digits to the right
| Or
(?<=^[A-Z]{0,6})[A-Z](?=[A-Z]*$) Match A-Z asserting 0-6 chars to the left and only chars A-Z to the right
| Or
\d(?<=^\d{2,3})(?=\d?-\d{5}-\d{7}$) Match a digit asserting 2-3 digits to the left and optional digit, - with 5 digits and - with 7 digits to the right
| Or
\d(?<=^\d{3}-\d{1,5}(?:-\d{1,5})?) Match a digit asserting 3 digits to the left followed - and 1-5 digits and optionally - with 1-5 digits
Regex demo | Java demo
String regex = "(?<=^\\d{3,7})\\d(?=\\d*$)|(?<=^[A-Z]{0,6})[A-Z](?=[A-Z]*$)|\\d(?<=^\\d{2,3})(?=\\d?-\\d{5}-\\d{7}$)|\\d(?<=^\\d{3}-\\d{1,5}(?:-\\d{1,5})?)";
String s1 = "123-12345-1234567";
String s2 = "AAAAAAAAA";
String s3 = "12334567";
System.out.println(s1.replaceAll(regex, "*"));
System.out.println(s2.replaceAll(regex, "*"));
System.out.println(s3.replaceAll(regex, "*"));
Output
1**-*****-*****67
*******AA
123*****
public static void main(String[] args) {
System.out.println("123-12345-1234567".replaceAll("(?<=.{1,})\\d(?=.{3,})", "*"));
System.out.println("AAAAAAAAA".replaceAll(".(?=.{2,})", "*"));
System.out.println("12334567".replaceAll("(?<=.{3,}).", "*"));
}
output:
1**-*****-*****67
*******AA
123*****

Java: String.replaceAll(regex, replacement);

I have a string of comma-separated user-ids and I want to eliminate/remove specific user-id from a string.
I’ve following possibilities of string and expected the result
int elimiateUserId = 11;
String css1 = "11,22,33,44,55";
String css2 = "22,33,11,44,55";
String css3 = "22,33,44,55,11";
// The expected result in all cases, after replacement, should be:
// "22,33,44,55"
I tried the following:
String result = css#.replaceAll("," + elimiateUserId, ""); // # = 1 or 2 or 3
result = css#.replaceAll(elimiateUserId + "," , "");
This logic fails in case of css3. Please suggest me a proper solution for this issue.
Note: I'm working with Java 7
I checked around the following posts, but could not find any solution:
Java String.replaceAll regex
java String.replaceAll regex question
Java 1.3 String.replaceAll() , replacement
You can use the Stream API in Java 8:
int elimiateUserId = 11;
String css1 = "11,22,33,44,55";
String css1Result = Stream.of(css1.split(","))
.filter(value -> !String.valueOf(elimiateUserId).equals(value))
.collect(Collectors.joining(","));
// css1Result = 22,33,44,55
If you want to use regex, you may use (remember to properly escape as java string literal)
,\b11\b|\b11\b,
This will ensure that 11 won't be matched as part of another number due to the word boundaries and only one comma (if two are present) is matched and removed.
You may build a regex like
^11,|,11\b
that will match 11, at the start of a string (^11,) or (|) ,11 not followed with any other word char (,11\b).
See the regex demo.
int elimiate_user_id = 11;
String pattern = "^" + elimiate_user_id + ",|," + elimiate_user_id + "\\b";
System.out.println("11,22,33,44,55,111".replaceAll(pattern, "")); // => 22,33,44,55,111
System.out.println("22,33,11,44,55,111".replaceAll(pattern, "")); // => 22,33,44,55,111
System.out.println("22,33,44,55,111,11".replaceAll(pattern, "")); // => 22,33,44,55,111
See the Java demo
Try to (^(11)(?:,))|((?<=,)(11)(?:,))|(,11$) expression to replaceAll:
final String regexp = MessageFormat.format("(^({0})(?:,))|((?<=,)({0})(?:,))|(,{0}$)", elimiateUserId)
String result = css#.replaceAll(regexp, "") //for all cases.
Here is an example:
https://regex101.com/r/LwJgRu/3
try this:
String result = css#.replaceAll("," + elimiateUserId, "")
.replaceAll(elimiateUserId + "," , "");
You can use two replace in one shot like :
int elimiateUserId = 11;
String result = css#.replace("," + elimiateUserId , "").replace(elimiateUserId + ",", "");
If your string is like ,11 the the first replace will do replace it with empty
If your string is like 11, the the second replace will do replace it with empty
result
11,22,33,44,55 -> 22,33,44,55
22,33,11,44,55 -> 22,33,44,55
22,33,44,55,11 -> 22,33,44,55
ideone demo
String result = css#.replaceAll("," + eliminate_user_id + "\b|\b" + eliminate_user_id + ",", '');
The regular expression here is:
, A leading comma.
eliminate_user_id I assumed the missing 'n' here was a typo.
\b Word boundary: word/number characters end here.
| OR
\b Word boundary: word/number characters begin here.
eliminate_user_id again.
, A trailing comma.
The word boundary marker, matching the beginning or end of a "word", is the magic here. It means that the 11 will match in these strings:
11,22,33,44,55
22,33,11,44,55
22,33,44,55,11
But not these strings:
111,112,113,114
411,311,211,111
There's a cleaner way, though:
String result = css#.replaceAll("(,?)\b" + eliminate_user_id + "\b(?(1)|,)", "");
The regular expression here is:
( A capturing group - what's in here, is in group 1.
,? An optional leading comma.
) End the capturing group.
\b Word boundary: word/number characters begin here.
eliminate_user_id I assumed the missing 'n' here was a typo.
\b Word boundary: word/number characters end here.
(?(1) If there's something in group 1, then require...
| ...nothing, but if there was nothing, then require...
, A trailing comma.
) end the if.
The "if" part here is a little unusual - you can find a little more information on regex conditionals here: http://www.regular-expressions.info/conditional.html
I am not sure if Java supports regex conditionals. Some posts here (Conditional Regular Expression in Java?) suggest that it does not :(
Side-note: for performance, if the list is VERY long and there are VERY many removals to be performed, the most obvious option is to just run the above line for each number to be removed:
String css = "11,22,33,44,55,66,77,88,99,1010,1111,1212,...";
Array<String> removals = ["11", "33", "55", "77", "99", "1212"];
for (i=0; i<removals.length; i++) {
css = css.replaceAll("," + removals[i] + "\b|\b" + eliminate_user_id + ",", "");
}
(code not tested: don't have access to a Java compiler here)
This will be fast enough (worst case scales with about O(m*n) for m removals from a string of n ids), but we can maybe do better.
One is to build the regex to be \b(11,42,18,13,123,...etc)\b - that is, make the regex search for all ids to be removed at the same time. In theory this scales a little worse, scaling with O(m*n) in every case rather than jut the worst case, but in practice should be considerably faster.
String css = "11,22,33,44,55,66,77,88,99,1010,1111,1212,...";
Array<String> removals = ["11", "33", "55", "77", "99", "1212"];
String removalsStr = String.join("|", removals);
css = css.replaceAll("," + removalsStr + "\b|\b" + removalsStr + ",", "");
But another approach might be to build a hashtable of the ids in the long string, then remove all the ids from the hashtable, then concatenate the remaining hashtable keys back into a string. Since hashtable lookups are effectively O(1) for sparse hashtables, that makes this scale with O(n). The tradeoff here is the extra memory for that hashtable, though.
(I don't think I can do this version without a java compiler handy. I would not recommend this approach unless you have a VAST (many thousands) list of IDs to remove, anyway, as it will be much uglier and more complex code).
I think its safer to maintain a whitelist and then use it as a reference to make further changes.
List<String> whitelist = Arrays.asList("22", "33", "44", "55");
String s = "22,33,44,55,11";
String[] sArr = s.split(",");
StringBuilder ids = new StringBuilder();
for (String id : sArr) {
if (whitelist.contains(id)) {
ids.append(id).append(", ");
}
}
String r = ids.substring(0, ids.length() - 2);
System.out.println(r);
If you need a solution with Regex, then the following works perfectly.
int elimiate_user_id = 11;
String css1 = "11,22,33,44,55";
String css2 = "22,33,11,44,55";
String css3 = "22,33,44,55,11";
String resultCss=css1.replaceAll(elimiate_user_id+"[,]*", "").replaceAll(",$", "");
I works with all types of input you desire.
This should work
replaceAll("(11,|,11)", "")
At least when you can guarantee when there is no 311, or ,113 or so

How can I split a String based on capitalization scheme? [duplicate]

I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.
(?<!^)(?=[A-Z])
It works as expected:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
For example with Java:
String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}
My problem is that it does not work in some cases:
Case 1: VALUE -> V / A / L / U / E
Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext
To my mind, the result shoud be:
Case 1: VALUE
Case 2: eclipse / RCP / Ext
In other words, given n uppercase chars:
if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
if the n chars are at the end, the group should be: (n chars).
Any idea on how to improve this regex?
The following regex works for all of the above examples:
public static void main(String[] args)
{
for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
System.out.println(w);
}
}
It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".
The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.
It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:
(?<=[a-z])(?=[A-Z])
Here is how this regex splits your example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCPExt
The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.
Addendum - Improved version
Note: This answer recently got an upvote and I realized that there is a better way...
By adding a second alternative to the above regex, all of the OP's test cases are correctly split.
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
Here is how the improved regex splits the example data:
value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCP / Ext
Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.
Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase
I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:
((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))
and here's an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
; (^[a-z]+) Match against any lower-case letters at the start of the string.
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)
Here I'm separating each word with a space, so here are some examples of how the string is transformed:
ThisIsATitleCASEString => This Is A Title CASE String
andThisOneIsCamelCASE => and This One Is Camel CASE
This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:
((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))
and an example of using it:
; Regex Breakdown: This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
; (^[a-z]+) Match against any lower-case letters at the start of the command.
; ([0-9]+) Match against one or more consecutive numbers (anywhere in the string, including at the start).
; ([A-Z]{1}[a-z]+) Match against Title case words (one upper case followed by lower case letters).
; ([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))) Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)
And here are some examples of how a string with numbers is transformed with this regex:
myVariable123 => my Variable 123
my2Variables => my 2 Variables
The3rdVariableIsHere => The 3 rdVariable Is Here
12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too
To handle more letters than just A-Z:
s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");
Either:
Split after any lowercase letter, that is followed by uppercase letter.
E.g parseXML -> parse, XML.
or
Split after any letter, that is followed by upper case letter and lowercase letter.
E.g. XMLParser -> XML, Parser.
In more readable form:
public class SplitCamelCaseTest {
static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";
static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
);
public static String splitCamelCase(String s) {
return SPLIT_CAMEL_CASE.splitAsStream(s)
.collect(joining(" "));
}
#Test
public void testSplitCamelCase() {
assertEquals("Camel Case", splitCamelCase("CamelCase"));
assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
assertEquals("XML Parser", splitCamelCase("XMLParser"));
assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
assertEquals("VALUE", splitCamelCase("VALUE"));
}
}
Brief
Both top answers here provide code using positive lookbehinds, which, is not supported by all regex flavours. The regex below will capture both PascalCase and camelCase and can be used in multiple languages.
Note: I do realize this question is regarding Java, however, I also see multiple mentions of this post in other questions tagged for different languages, as well as some comments on this question for the same.
Code
See this regex in use here
([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)
Results
Sample Input
eclipseRCPExt
SomethingIsWrittenHere
TEXTIsWrittenHERE
VALUE
loremIpsum
Sample Output
eclipse
RCP
Ext
Something
Is
Written
Here
TEXT
Is
Written
HERE
VALUE
lorem
Ipsum
Explanation
Match one or more uppercase alpha character [A-Z]+
Or match zero or one uppercase alpha character [A-Z]?, followed by one or more lowercase alpha characters [a-z]+
Ensure what follows is an uppercase alpha character [A-Z] or word boundary character \b
You can use StringUtils.splitByCharacterTypeCamelCase("loremIpsum") from Apache Commons Lang.
You can use the expression below for Java:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)
Instead of looking for separators that aren't there you might also considering finding the name components (those are certainly there):
String test = "_eclipse福福RCPExt";
Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);
Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
// matches should be consecutive
if (componentMatcher.start() != endOfLastMatch) {
// do something horrible if you don't want garbage in between
// we're lenient though, any Chinese characters are lucky and get through as group
String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
components.add(startOrInBetween);
}
components.add(componentMatcher.group(1));
endOfLastMatch = componentMatcher.end();
}
if (endOfLastMatch != test.length()) {
String end = test.substring(endOfLastMatch, componentMatcher.start());
components.add(end);
}
System.out.println(components);
This outputs [eclipse, 福福, RCP, Ext]. Conversion to an array is of course simple.
I can confirm that the regex string ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b) given by ctwheels, above, works with the Microsoft flavour of regex.
I would also like to suggest the following alternative, based on ctwheels' regex, which handles numeric characters: ([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b).
This able to split strings such as:
DrivingB2BTradeIn2019Onwards
to
Driving B2B Trade in 2019 Onwards
A JavaScript Solution
/**
* howToDoThis ===> ["", "how", "To", "Do", "This"]
* #param word word to be split
*/
export const splitCamelCaseWords = (word: string) => {
if (typeof word !== 'string') return [];
return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};

Subtracting characters in a back reference from a character class in java.util.regex.Pattern

Is it possible to subtract the characters in a Java regex back reference from a character class?
e.g., I want to use String#matches(regex) to match either:
any group of characters that are [a-z'] that are enclosed by "
Matches: "abc'abc"
Doesn't match: "1abc'abc"
Doesn't match: 'abc"abc'
any group of characters that are [a-z"] that are enclosed by '
Matches: 'abc"abc'
Doesn't match: '1abc"abc'
Doesn't match: "abc'abc"
The following regex won't compile because [^\1] isn't supported:
(['"])[a-z'"&&[^\1]]*\1
Obviously, the following will work:
'[a-z"]*'|"[a-z']*"
But, this style isn't particularly legible when a-z is replaced by a much more complex character class that must be kept the same in each side of the "or" condition.
I know that, in Java, I can just use String concatenation like the following:
String charClass = "a-z";
String regex = "'[" + charClass + "\"]*'|\"[" + charClass + "']*\"";
But, sometimes, I need to specify the regex in a config file, like XML, or JSON, etc., where java code is not available.
I assume that what I'm asking is almost definitely not possible, but I figured it wouldn't hurt to ask...
One approach is to use a negative look-ahead to make sure that every character in between the quotes is not the quotes:
(['"])(?:(?!\1)[a-z'"])*+\1
^^^^^^
(I also make the quantifier possessive, since there is no use for backtracking here)
This approach is, however, rather inefficient, since the pattern will check for the quote character for every single character, on top of checking that the character is one of the allowed character.
The alternative with 2 branches in the question '[a-z"]*'|"[a-z']*" is better, since the engine only checks for the quote character once and goes through the rest by checking that the current character is in the character class.
You could use two patterns in one OR-separated pattern, expressing both your cases:
// | case 1: [a-z'] enclosed by "
// | | OR
// | | case 2: [a-z"] enclosed by '
Pattern p = Pattern.compile("(?<=\")([a-z']+)(?=\")|(?<=')([a-z\"]+)(?=')");
String[] test = {
// will match group 1 (for case 1)
"abcd\"efg'h\"ijkl",
// will match group 2 (for case 2)
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
Output
efg'h
null
null
efg"h
Note
There is nothing stopping you from specifying the enclosing characters or the character class itself somewhere else, then building your Pattern with components unknown at compile-time.
Something in the lines of:
// both strings are emulating unknown-value arguments
String unknownEnclosingCharacter = "\"";
String unknownCharacterClass = "a-z'";
// probably want to catch a PatternSyntaxException here for potential
// issues with the given arguments
Pattern p = Pattern.compile(
String.format(
"(?<=%1$s)([%2$s]+)(?=%1$s)",
unknownEnclosingCharacter,
unknownCharacterClass
)
);
String[] test = {
"abcd\"efg'h\"ijkl",
"abcd'efg\"h'ijkl",
};
for (String t: test) {
Matcher m = p.matcher(t);
while (m.find()) {
// note: only main group here
System.out.println(m.group());
}
}
Output
efg'h

I need a Java regular expression

I am currently using the following regular expression:
^[a-zA-Z]{0,}(\\*?)?[a-zA-Z0-9]{0,}
to check a string to start with an alpha character and end with alphanumeric characters and have an asterisk(*) anywhere in the string but only a maximum of one time. The problem here is that if the given string still passes if it starts with a number but doesn't have an *, which should fail. How can I rework the regex to fail this case?
ex.
TE - pass
*TE - pass
TE* - pass
T*E - pass
*9TE - pass
*TE* - fail (multiple asterisk)
9E - fail (starts with number)
EDIT:
Sorry to introduce a late edit but I also need to ensure that the string is 8 characters or less, can I include that in the regex as well? Or should I just check the string length after the regex validation?
This passes your example:
"^([a-zA-Z]+\\*?|\\*)[a-zA-Z0-9]*$"
It says:
start with: [a-zA-Z]+\\*? (a letter and maybe a star)
| (or)
\\* a single star
and end with [a-zA-Z0-9]* (an alphanumeric character)
Code to test it:
public static void main(final String[] args) {
final Pattern p = Pattern.compile("^([a-zA-Z]+\\*?|\\*)\\w*$");
System.out.println(p.matcher("TE").matches());
System.out.println(p.matcher("*TE").matches());
System.out.println(p.matcher("TE*").matches());
System.out.println(p.matcher("T*E").matches());
System.out.println(p.matcher("*9TE").matches());
System.out.println(p.matcher("*TE*").matches());
System.out.println(p.matcher("9E").matches());
}
Per Stargazer, if you allow alphanumeric before the star, then use this:
^([a-zA-Z][a-zA-Z0-9]*\\*?|\\*)\\w*$
One possible way is to separate into 2 conditions:
^(?=[^*]*\*?[^*]*$)[a-zA-Z*][a-zA-Z0-9*]*$
The (?=[^*]*\*?[^*]*$) part ensures there is at most one * in the string.
The [a-zA-Z*][a-zA-Z0-9*]* part ensures it starts with an alphabet or a *, and followed by only alphanumerals or *.
It might be easier to develop and maintain later if you just break your regular expressions into a few pieces, e.g., one for the start and end, and one for the asterisk. I am not sure what the overall performance effect would be, you would have simpler expressions but have to run a few of them.
This is Python, it'll need some massaging for Java:
>>> import re
>>> p = re.compile('^([a-z][^*]*[*]?[^*]*[a-z0-9]|[*][^*]*[a-z0-9]|[a-z][^*]*[*])$', re.I)
>>> for test in ['TE', '*TE', 'TE*', 'T*E', '*9TE', '*TE*', '9E']:
... if p.match(test):
... print test, 'pass'
... else:
... print test, 'fail'
...
TE pass
*TE pass
TE* pass
T*E pass
*9TE pass
*TE* fail
9E fail
Hope I didn't miss anything.
How about this, it's easier to read:
boolean pass = input.replaceFirst("\\*", "").matches("^[a-zA-Z].*\\w$");
Assuming I read right, you want to:
Start with an alpha character
End with an alphanumeric character
Allow up to one * anywhere
At most one asterisk, alphabetic characters anywhere and numbers anywhere but at start.
String alpha = "[a-zA-Z]";
String alnum = "[a-zA-Z0-9]";
String asteriskNone = "^" + alpha + "+" + alnum + "*";
String asteriskStart = "^\\*" + alnum + "*";
String asteriskInside = "^" + alpha + "+" + alnum + "+\\*" + alnum + "*";
String yourRegex = asteriskNone + "|" + asteriskStart + "|"
+ asteriskInside;
String[] tests = {"TE","*TE","TE*","T*E","*9TE","*TE*", "9E"};
for (String test : tests)
System.out.println(test + " " + (test.matches(yourRegex)?"PASS":"FAIL"));
Look for two possible patterns, one starting with *, and one with an alpha char:
^[a-zA-Z][a-zA-Z0-9]*(\\*?)?[a-zA-Z0-9]*|\*[a-zA-Z0-9]*
^([a-zA-Z][a-zA-Z0-9]*\*|\*|[a-zA-Z])([a-zA-Z0-9])*$
the parenthesis around the second half are for clarity and can be safely excluded.
This was a tough one (liked the challenge), but here it is:
^(\*[a-zA-Z0-9]+|[a-zA-Z]+[\*]{1}[a-zA-Z]*)$
In order to comply with T9*Z, as pointed out on another post with StarGazer712, I had to change it to:
^(\*[a-zA-Z0-9]+|[a-zA-Z]{1}[a-zA-Z0-9]*[\*]{1}[a-zA-Z0-9]*)$

Categories

Resources