Java Regex Handling - java

Looking for a java regex function to 1 - return true if Special characters present before or after from my list of array elements
2 - Return False if any alpha characters present before and after from my list of array elements
My array elements wordsList = {"TIN","tin"}
Input:
1-CardTIN is 1111
2-Card-TIN:2222
3-CardTINis3333
4-Card#TIN#4444
5-CardTIN#5555
6-TINis9999
Expected Output:
1-True
2-True
3-False
4-True
5-True
6-False
I have tried regex function to cover these cases
Arrays.stream(wordsList).anyMatch(word -> Pattern .compile("([^a-zA-Z0-9])" + Pattern.quote(word) +"([^a-zA-Z0-9])".matcher(string).find()
But the scenario CardTin#555 is not giving the desired result as expected
Kindly help with these case

You can make sure that either tin or TIN using the character classes is not present:
^(?![a-zA-Z0-9]*(?:TIN|tin)[a-zA-Z0-9]*$).*(?:TIN|tin).*
(?i) Case insensitive match
^ Start of string
(?![a-zA-Z0-9]*(?:TIN|tin)[a-zA-Z0-9]*$) Assert that TIN or tin (as it is case insensitive, it does not matter for this example) does not occur between alpha numeric chars (no special characters so to say)
.*(?:TIN|tin).* Match the word in the line
You might add word boundaries \\b(?:TIN|tin)\\b for a more precise match.
Regex demo
Example for a single line:
String s = "CardTIN is 1111";
String[] wordsList = {"TIN","tin"};
String alt = "(?:" + String.join("|", wordsList) + ")";
String regex = "(?i)^(?![a-zA-Z0-9]*" + alt + "[a-zA-Z0-9]*$).*" + alt + ".*";
System.out.println(s.matches(regex));
Output
true
You can also join the list of words on | and then filter the list:
String strings[] = { "CardTIN is 1111", "Card-TIN:2222", "CardTINis3333", "Card#TIN#4444", "CardTIN#5555", "TINis9999", "test", "Card Tin is 1111" };
String[] wordsList = {"TIN","tin"};
String alt = "(?:" + String.join("|", wordsList) + ")";
String regex = "(?i)^(?![a-zA-Z0-9]*" + alt + "[a-zA-Z0-9]*$).*" + alt + ".*";
List<String> result = Arrays.stream(strings)
.filter(word -> word.matches(regex))
.collect(Collectors.toList());
for (String res : result)
System.out.println(res);
Output
CardTIN is 1111
Card-TIN:2222
Card#TIN#4444
CardTIN#5555
Card Tin is 1111
See a Java demo.

I am not sure whether your requirements can be put in regex. If at all, you are running way into specialities that will become hard to maintain.
Therefore you might be better off using a lexer/scanner combination which usually make up a complete parser. Check out ANTLR.

Related

Java: String.replaceAll(regex, replacement);

I have a string of comma-separated user-ids and I want to eliminate/remove specific user-id from a string.
I’ve following possibilities of string and expected the result
int elimiateUserId = 11;
String css1 = "11,22,33,44,55";
String css2 = "22,33,11,44,55";
String css3 = "22,33,44,55,11";
// The expected result in all cases, after replacement, should be:
// "22,33,44,55"
I tried the following:
String result = css#.replaceAll("," + elimiateUserId, ""); // # = 1 or 2 or 3
result = css#.replaceAll(elimiateUserId + "," , "");
This logic fails in case of css3. Please suggest me a proper solution for this issue.
Note: I'm working with Java 7
I checked around the following posts, but could not find any solution:
Java String.replaceAll regex
java String.replaceAll regex question
Java 1.3 String.replaceAll() , replacement
You can use the Stream API in Java 8:
int elimiateUserId = 11;
String css1 = "11,22,33,44,55";
String css1Result = Stream.of(css1.split(","))
.filter(value -> !String.valueOf(elimiateUserId).equals(value))
.collect(Collectors.joining(","));
// css1Result = 22,33,44,55
If you want to use regex, you may use (remember to properly escape as java string literal)
,\b11\b|\b11\b,
This will ensure that 11 won't be matched as part of another number due to the word boundaries and only one comma (if two are present) is matched and removed.
You may build a regex like
^11,|,11\b
that will match 11, at the start of a string (^11,) or (|) ,11 not followed with any other word char (,11\b).
See the regex demo.
int elimiate_user_id = 11;
String pattern = "^" + elimiate_user_id + ",|," + elimiate_user_id + "\\b";
System.out.println("11,22,33,44,55,111".replaceAll(pattern, "")); // => 22,33,44,55,111
System.out.println("22,33,11,44,55,111".replaceAll(pattern, "")); // => 22,33,44,55,111
System.out.println("22,33,44,55,111,11".replaceAll(pattern, "")); // => 22,33,44,55,111
See the Java demo
Try to (^(11)(?:,))|((?<=,)(11)(?:,))|(,11$) expression to replaceAll:
final String regexp = MessageFormat.format("(^({0})(?:,))|((?<=,)({0})(?:,))|(,{0}$)", elimiateUserId)
String result = css#.replaceAll(regexp, "") //for all cases.
Here is an example:
https://regex101.com/r/LwJgRu/3
try this:
String result = css#.replaceAll("," + elimiateUserId, "")
.replaceAll(elimiateUserId + "," , "");
You can use two replace in one shot like :
int elimiateUserId = 11;
String result = css#.replace("," + elimiateUserId , "").replace(elimiateUserId + ",", "");
If your string is like ,11 the the first replace will do replace it with empty
If your string is like 11, the the second replace will do replace it with empty
result
11,22,33,44,55 -> 22,33,44,55
22,33,11,44,55 -> 22,33,44,55
22,33,44,55,11 -> 22,33,44,55
ideone demo
String result = css#.replaceAll("," + eliminate_user_id + "\b|\b" + eliminate_user_id + ",", '');
The regular expression here is:
, A leading comma.
eliminate_user_id I assumed the missing 'n' here was a typo.
\b Word boundary: word/number characters end here.
| OR
\b Word boundary: word/number characters begin here.
eliminate_user_id again.
, A trailing comma.
The word boundary marker, matching the beginning or end of a "word", is the magic here. It means that the 11 will match in these strings:
11,22,33,44,55
22,33,11,44,55
22,33,44,55,11
But not these strings:
111,112,113,114
411,311,211,111
There's a cleaner way, though:
String result = css#.replaceAll("(,?)\b" + eliminate_user_id + "\b(?(1)|,)", "");
The regular expression here is:
( A capturing group - what's in here, is in group 1.
,? An optional leading comma.
) End the capturing group.
\b Word boundary: word/number characters begin here.
eliminate_user_id I assumed the missing 'n' here was a typo.
\b Word boundary: word/number characters end here.
(?(1) If there's something in group 1, then require...
| ...nothing, but if there was nothing, then require...
, A trailing comma.
) end the if.
The "if" part here is a little unusual - you can find a little more information on regex conditionals here: http://www.regular-expressions.info/conditional.html
I am not sure if Java supports regex conditionals. Some posts here (Conditional Regular Expression in Java?) suggest that it does not :(
Side-note: for performance, if the list is VERY long and there are VERY many removals to be performed, the most obvious option is to just run the above line for each number to be removed:
String css = "11,22,33,44,55,66,77,88,99,1010,1111,1212,...";
Array<String> removals = ["11", "33", "55", "77", "99", "1212"];
for (i=0; i<removals.length; i++) {
css = css.replaceAll("," + removals[i] + "\b|\b" + eliminate_user_id + ",", "");
}
(code not tested: don't have access to a Java compiler here)
This will be fast enough (worst case scales with about O(m*n) for m removals from a string of n ids), but we can maybe do better.
One is to build the regex to be \b(11,42,18,13,123,...etc)\b - that is, make the regex search for all ids to be removed at the same time. In theory this scales a little worse, scaling with O(m*n) in every case rather than jut the worst case, but in practice should be considerably faster.
String css = "11,22,33,44,55,66,77,88,99,1010,1111,1212,...";
Array<String> removals = ["11", "33", "55", "77", "99", "1212"];
String removalsStr = String.join("|", removals);
css = css.replaceAll("," + removalsStr + "\b|\b" + removalsStr + ",", "");
But another approach might be to build a hashtable of the ids in the long string, then remove all the ids from the hashtable, then concatenate the remaining hashtable keys back into a string. Since hashtable lookups are effectively O(1) for sparse hashtables, that makes this scale with O(n). The tradeoff here is the extra memory for that hashtable, though.
(I don't think I can do this version without a java compiler handy. I would not recommend this approach unless you have a VAST (many thousands) list of IDs to remove, anyway, as it will be much uglier and more complex code).
I think its safer to maintain a whitelist and then use it as a reference to make further changes.
List<String> whitelist = Arrays.asList("22", "33", "44", "55");
String s = "22,33,44,55,11";
String[] sArr = s.split(",");
StringBuilder ids = new StringBuilder();
for (String id : sArr) {
if (whitelist.contains(id)) {
ids.append(id).append(", ");
}
}
String r = ids.substring(0, ids.length() - 2);
System.out.println(r);
If you need a solution with Regex, then the following works perfectly.
int elimiate_user_id = 11;
String css1 = "11,22,33,44,55";
String css2 = "22,33,11,44,55";
String css3 = "22,33,44,55,11";
String resultCss=css1.replaceAll(elimiate_user_id+"[,]*", "").replaceAll(",$", "");
I works with all types of input you desire.
This should work
replaceAll("(11,|,11)", "")
At least when you can guarantee when there is no 311, or ,113 or so

Regex extract last numbers from String

I have some strings which are indexed and are dynamic.
For example:
name01,
name02,
name[n]
now I need to separate name from index.
I've come up with this regex which works OK to extract index.
([0-9]+(?!.*[0-9]))
But, there are some exceptions of these names. Some of them may have a number appended which is not the index.(These strings are limited and I know them, meaning I can add them as "exceptions" in the regex)
For example,
panLast4[01]
Here the last '4' is not part of the index, so I need to distinguish.
So I tried:
[^panLast4]([0-9]+(?!.*[0-9]))
Which works for panLast4[123] but not panLast4[43]
Note: the "[" and "]" is for explanation purposes only, it's not present in the strings
What is wrong?
Thanks
You can use the split method with this pattern:
(?<!^panLast(?=4)|^nm(?=14)|^nm1(?=4))(?=[0-9]+$)
The idea is to find the position where there are digits until the end of the string (?=[0-9]+$). But the match will succeed if the negative lookbehind allows it (to exclude particular names (panLast4 and nm14 here) that end with digits). When one of these particular names is found, the regex engine must go to the next position to obtain a match.
Example:
String s ="panLast412345";
String[] res = s.split("(?<!^panLast(?=4)|^nm(?=14)|^nm1(?=4))(?=[0-9]+$)", 2);
if ( res.length==2 ) {
System.out.println("name: " + res[0]);
System.out.println("ID: " + res[1]);
}
An other method with matches() that simply uses a lazy quantifier as last alternative:
Pattern p = Pattern.compile("(panLast4|nm14|.*?)([0-9]+)");
String s = "panLast42356";
Matcher m = p.matcher(s);
if ( m.matches() && m.group(1).length()>0 ) {
System.out.println("name: "+ m.group(1));
System.out.println("ID: "+ m.group(2));
}

RegEx Split on / Except when Surrounded by []

I am trying to split a string in Java on / but I need to ignore any instances where / is found between []. For example if I have the following string
/foo/bar[donkey=King/Kong]/value
Then I would like to return the following in my output
foo
bar[donkey=King/Kong]
value
I have seen a couple other similar posts, but I haven't found anything that fits exactly what I'm trying to do. I've tried the String.split() method and as follows and have seen weird results:
Code: value.split("/[^/*\\[.*/.*\\]]")
Result: [, oo, ar[donkey=King, ong], alue]
What do I need to do in order to get back the following:
Desired Result: [, foo, bar[donkey=King/Kong], value]
Thanks,
Jeremy
You need to split on the / followed by an 0 or more balanced pairs of brackets:
String str = "/foo/bar[donkey=King/Kong]/value";
String[] arr = str.split("/(?=([[^\\[\\]]*\\[[^\\[\\]]*\\])*[^\\[\\]]*$)");
System.out.println(Arrays.toString(arr));
Output:
[, foo, bar[donkey=King/Kong], value]
More User friendly explanation
String[] arr = str.split("(?x)/" + // Split on `/`
"(?=" + // Followed by
" (" + // Start a capture group
" [^\\[\\]]*" + // 0 or more non-[, ] character
" \\[" + // then a `[`
" [^\\]\\[]*" + // 0 or more non-[, ] character
" \\]" + // then a `]`
" )*" + // 0 or more repetition of previous pattern
" [^\\[\\]]*" + // 0 or more non-[, ] characters
"$)"); // till the end
Of the following string, the regex below will match foo and bar, but not fox and baz, because they're followed by a close bracket. Study up on negative lookahead.
fox]foo/bar/baz]
Regex:
\b(\w+)\b(?!])

finding text until end of line regex

I'm trying to use regex to find a particular starting character and then getting the rest of the text on that particular line.
For example the text can be like ...
V: mid-voice
T: tempo
I want to use regex to grab "V:" and the the text behind it.
Is there any good, quick way to do this using regular expressions?
If your starting character were fixed, you would create a pattern like:
Pattern vToEndOfLine = Pattern.compile("(V:[^\\n]*)")
and use find() rather than matches().
If your starting character is dynamic, you can always write a method to return the desired pattern:
Pattern getTailOfLinePatternFor(String start) {
return Pattern.compile("(" + start + "[^\\n]*");
}
These can be worked on a little bit depending on your needs.
For a pattern match try for your example:
V:.*$
Here's the best, cleanest and easiest (ie one-line) way:
String[] parts = str.split("(?<=^\\w+): ");
Explanation:
The regex uses a positive look behind to break on the ": " after the first word (in this case "V") and capture both halves.
Here's a test:
String str = "V: mid-voice T: tempo";
String[] parts = str.split("(?<=^\\w+): ");
System.out.println("key = " + parts[0] + "\nvalue = " + parts[1]);
Output:
key = V
value = mid-voice T: tempo

Java Split not working as expected

I am trying to use a simple split to break up the following string: 00-00000
My expression is: ^([0-9][0-9])(-)([0-9])([0-9])([0-9])([0-9])([0-9])
And my usage is:
String s = "00-00000";
String pattern = "^([0-9][0-9])(-)([0-9])([0-9])([0-9])([0-9])([0-9])";
String[] parts = s.split(pattern);
If I play around with the Pattern and Matcher classes I can see that my pattern does match and the matcher tells me my groupCount is 7 which is correct. But when I try and split them I have no luck.
String.split does not use capturing groups as its result. It finds whatever matches and uses that as the delimiter. So the resulting String[] are substrings in between what the regex matches. As it is the regex matches the whole string, and with the whole string as a delimiter there is nothing else left so it returns an empty array.
If you want to use regex capturing groups you will have to use Matcher.group(), String.split() will not do.
for your example, you could simply do this:
String s = "00-00000";
String pattern = "-";
String[] parts = s.split(pattern);
I can not be sure, but I think what you are trying to do is to get each matched group into an array.
Matcher matcher = Pattern.compile(pattern).matcher();
if (matcher.matches()) {
String s[] = new String[matcher.groupCount()) {
for (int i=0;i<matches.groupCount();i++) {
s[i] = matcher.group(i);
}
}
}
From the documentation:
String[] split(String regex) -- Returns: the array of strings computed by splitting this string around matches of the given regular expression
Essentially the regular expression is used to define delimiters in the input string. You can use capturing groups and backreferences in your pattern (e.g. for lookarounds), but ultimately what matters is what and where the pattern matches, because that defines what goes into the returned array.
If you want to split your original string into 7 parts using regular expression, then you can do something like this:
String s = "12-3456";
String[] parts = s.split("(?!^)");
System.out.println(parts.length); // prints "7"
for (String part : parts) {
System.out.println("Part [" + part + "]");
} // prints "[1] [2] [-] [3] [4] [5] [6] "
This splits on zero-length matching assertion (?!^), which is anywhere except before the first character in the string. This prevents the empty string to be the first element in the array, and trailing empty string is already discarded because we use the default limit parameter to split.
Using regular expression to get individual character of a string like this is an overkill, though. If you have only a few characters, then the most concise option is to use foreach on the toCharArray():
for (char ch : "12-3456".toCharArray()) {
System.out.print("[" + ch + "] ");
}
This is not the most efficient option if you have a longer string.
Splitting on -
This may also be what you're looking for:
String s = "12-3456";
String[] parts = s.split("-");
System.out.println(parts.length); // prints "2"
for (String part : parts) {
System.out.print("[" + part + "] ");
} // prints "[12] [3456] "

Categories

Resources