Java regex exact match with question mark and word boundary

Java regex exact match with question mark and word boundary - java

In java, I am trying to determine if a user inputted string (meaning I do not know what the input will be) is contained exactly within another string, on word boundaries. So input of the should not be matched in the text there is no match. I am running into issues when there is punctuation in the inputted string however and could use some help.
With no punctuation, this works just fine:
String input = "string contain";
Pattern p = Pattern.compile("\\b" + Pattern.quote(input) + "\\b");
//both should and do match
System.out.println(p.matcher("does this string contain the input").find());
System.out.println(p.matcher("does this string contain? the input").find());
However when the input has a question mark in it, the matching with the word boundary doesn't seem to work:
String input = "string contain?";
Pattern p = Pattern.compile("\\b" + Pattern.quote(input) + "\\b");
//should not match - doesn't
System.out.println(p.matcher("does this string contain the input").find());
//expected match - doesn't
System.out.println(p.matcher("does this string contain? the input").find());
//should not match - doesn't
System.out.println(p.matcher("does this string contain?fail the input").find());
Any help would be appreciated.

There's no word boundary between ? and , because there's no adjacent word character; that's why your pattern doesn't match. You can change it to this:
Pattern.compile("(^|\\W)" + Pattern.quote(input) + "($|\\W)");
That matches begin of input or non-word character - pattern - end of input or non-word character. Or, better, you use a negative lookbehind and a negative lookahead:
Pattern p = Pattern.compile("(?<!\\w)" + Pattern.quote(input) + "(?!\\w)");
This means, before and after your pattern there must not be a word character.

You can use :
Pattern p = Pattern.compile("(\\s|^)" + Pattern.quote(input) + "(\\s|$)");
//---------------------------^^^^^^^----------------------------^^^^^^^
for Strings you will get :
does this string contain the input -> false
does this string contain? the input -> true
does this fail the input string contain? -> true
does this string contain?fail the input -> false
string contain? the input -> true
The idea is, matches the strings that contains your input + space, or end with your input.

You are matching using word boundaries: \b.
Java RegEx implementation deems following characters as word characters:
\w := [a-zA-Z_0-9]
Any non-word characters are simply ones outside the above group
[^\w] := [^a-zA-Z_0-9]
Word boundary is a transition from [a-zA-Z_0-9] to [^a-zA-Z_0-9] and vice-versa.
For input "does this string contain? the input" and literal pattern \\b\\Qstring contain?\\E\\b the last word boundary \\b falls within the input text into a transition from ? to <white space> and therefore is not a valid word to non-word nor non-word to word transition as per above definitions, which means that it is not a word boundary.

Related

Java regex, replace certain characters except if it matches a pattern

I have this string "person","hobby","key" and I want to remove " " for all words except for key so the output will be person,hobby,"key"
String str = "\"person\",\"hobby\",\"key\"";
System.out.println(str+"\n");
str=str.replaceAll("/*regex*/","");
System.out.println(str); //person,hobby,"key"

You may use the following pattern:
\"(?!key\")(.+?)\"
And replace with $1
Details:
\" - Match a double quotation mark character.
(?!key\") - Negative Lookahead (not followed by the word "key" and another double quotation mark).
(.+?) - Match one or more characters (lazy) and capture them in group 1.
\" - Match another double quotation mark character.
Substitution: $1 - back reference to whatever was matched in group 1.
Regex demo.
Here's a full example:
String str = "\"person\",\"hobby\",\"key\"";
String pattern = "\"(?!key\")(.+?)\"";
String result = str.replaceAll(pattern, "$1");
System.out.println(result); // person,hobby,"key"
Try it online.

Masking using regular expressions for below format

I am trying to write a regular expression to mask the below string. Example below.
Input
A1../D//FASDFAS--DFASD//.F
Output (Skip first five and last two Alphanumeric's)
A1../D//FA***********D//.F
I am trying using below regex
([A-Za-z0-9]{5})(.*)(.{2})
Any help would be highly appreciated.

You solve your issue by using Pattern and Matcher with a regex which match multiple groups :
String str = "A1../D//FASDFAS--DFASD//.F";
Pattern pattern = Pattern.compile("(.*?\\/\\/..)(.*?)(.\\/\\/.*)");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
str = matcher.group(1)
+ matcher.group(2).replaceAll(".", "*")
+ matcher.group(3);
}
Detail
(.*?\\/\\/..) first group to match every thing until //
(.*?) second group to match every thing between group one and three
(.\\/\\/.*) third group to match every thing after the last character before the // until the end of string
Outputs
A1../D//FA***********D//.F
I think this solution is more readable.

If you want to do that with a single regex you may use
text = text.replaceAll("(\\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$)|^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5}).", "$1*");
Or, using the POSIX character class Alnum:
text = text.replaceAll("(\\G(?!^|(?:\\p{Alnum}\\P{Alnum}*){2}$)|^(?:\\P{Alnum}*\\p{Alnum}){5}).", "$1*");
See the Java demo and the regex demo. If you plan to replace any code point rather than a single code unit with an asterisk, replace . with \P{M}\p{M}*+ ("\\P{M}\\p{M}*+").
To make . match line break chars, add (?s) at the start of the pattern.
Details
(\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$)|^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5}) -
\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$) - a location after the successful match that is not followed with 2 occurrences of an alphanumeric char followed with 0 or more chars other than alphanumeric chars
| - or
^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5} - start of string, followed with five occurrences of 0 or more non-alphanumeric chars followed with an alphanumeric char
. - any code unit other than line break characters (if you use \P{M}\p{M}*+ - any code point).

Usually, masking of characters in the middle of a string can be done using negative lookbehind (?<!) and positive lookahead groups (?=).
But in this case lookbehind group can't be used because it does not have an obvious maximum length due to unpredictable number of non-alphanumeric characters between first five alphanumeric characters (. and / in the A1../D//FA).
A substring method can used as a workaround for inability to use negative lookbehind group:
String str = "A1../D//FASDFAS--DFASD//.F";
int start = str.replaceAll("^((?:\\W{0,}\\w{1}){5}).*", "$1").length();
String maskedStr = str.substring(0, start) +
str.substring(start).replaceAll(".(?=(?:\\W{0,}\\w{1}){2})", "*");
System.out.println(maskedStr);
// A1../D//FA***********D//.F
But the most straightforward way is to use java.util.regex.Pattern and java.util.regex.Matcher:
String str = "A1../D//FASDFAS--DFASD//.F";
Pattern pattern = Pattern.compile("^((?:\\W{0,}\\w{1}){5})(.+)((?:\\W{0,}\\w{1}){2})");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
String maskedStr = matcher.group(1) +
"*".repeat(matcher.group(2).length()) +
matcher.group(3);
System.out.println(maskedStr);
// A1../D//FA***********D//.F
}
\W{0,} - 0 or more non-alphanumeric characters
\w{1} - exactly 1 alphanumeric character
(\W{0,}\w{1}){5} - 5 alphanumeric characters and any number of alphanumeric characters in between
(?:\W{0,}\w{1}){5} - do not capture as a group
^((?:\\W{0,}\\w{1}){5})(.+)((?:\\W{0,}\\w{1}){2})$ - substring with first five alphanumeric characters (group 1), everything else (group 2), substring with last 2 alphanumeric characters (group 3)

How can I strip all non digits in a string except the first character?

I have a string that I want to make sure that the format is always a + followed by digits.
The following would work:
String parsed = inputString.replaceAll("[^0-9]+", "");
if(inputString.charAt(0) == '+') {
result = "+" + parsed;
}
else {
result = parsed;
}
But is there a way to have a regex in the replaceAll that would keep the + (if exists) in the beginning of the string and replace all non digits in the first line?

The following statement with the given regex would do the job:
String result = inputString.replaceAll("(^\\+)|[^0-9]", "$1");
(^\\+) find either a plus sign at the beginning of string and put it to a group ($1),
| or
[^0-9] find a character which is not a number
$1 and replace it with nothing or the plus sign at the start of group ($1)

You can use this expression:
String r = s.replaceAll("((?<!^)[^0-9]|^[^0-9+])", "");
The idea is to replace any non-digit when it is not the initial character of the string (that's the (?<!^)[^0-9] part with a lookbehind) or any character that is not a digit or plus that is the initial character of the string (the ^[^0-9+] part).
Demo.

What about just
(?!^)\D+
Java string:
"(?!^)\\D+"
Demo at regex101.com
\D matches a character that is not a digit [^0-9]
(?!^) using a negative lookahead to check, if it is not the initial character

Yes you can use this kind of replacement:
String parsed = inputString.replaceAll("^[^0-9+]*(\\+)|[^0-9]+", "$1");
if present and before the first digit in the string, the + character is captured in group 1. For example: dfd+sdfd12+sdf12 returns +1212 (the second + is removed since its position is after the first digit).

try this
1- This will allow negative and positive number and will match app special char except - and + at first position.
(?!^[-+])[^0-9.]
2- If you only want to allow + at first position
(?!^[+])[^0-9.]

java regex to accept any word other than none

I need a regular expression to match any string other than none.
I tried using
regular exp ="^[^none]$",
But it does not work.

If you are matching a String against a specific word in Java you should use equals(). In this case you want to invert the match so your logic becomes:
if(!theString.equals("none")) {
// do stuff here
}
Much less resource hungry, and much more intuitive.
If you need to match a String which contains the word "none", you are probably looking for something like:
if(theString.matches("\\bnone\\b")) {
/* matches theString if the substring "none" is enclosed between
* “word boundaries”, so it will not match for example: "nonetheless"
*/
}
Or if you can be fairly certain that “word boundaries” mean a specific delimiter you can still evade regular expressions by using the indexOf() method:
int i = theString.indexOf("none");
if(i > -1) {
if(i > 0) {
// check theString.charAt(i - 1) to see if it is a word boundary
// e.g.: whitespace
}
// the 4 is because of the fact that "none" is 4 characters long.
if((theString.length() - i - 4) > 0) {
// check theString.charAt(i + 4) to see if it is a word boundary
// e.g.: whitespace
}
}
else {
// not found.
}

You can use the regular expression (?!^none$).*. See this question for details: Regex inverse matching on specific string?
The reason "^[^none]$" doesn't work is that you are actually matching all strings except the strings "n", "o", or "e".
Of course, it would be easier to just use String.equals like so: !"none".equals(testString).

Actually this is the regex to match all words except "word":
Pattern regex = Pattern.compile("\\b(?!word\\b)\\w+\\b");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
You must use word boundaries so that "word" is not contained in other words.
Explanation:
"
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
Lorem # Match the characters “Lorem” literally
\b # Assert position at a word boundary
)
\w # Match a single character that is a “word character” (letters, digits, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
"

This is the regex you are looking for:
Pattern p = Pattern.compile("^(?!none$).*$");
Matcher m = p.matcher("your string");
System.out.println(s + ": " + (m.matches() ? "Match" : "NO Match"));
Having that said, if you are not forced to use a regex that matches everything but "none", the more simple, fast, clear, and easy to write and understand is this:
Pattern p = Pattern.compile("^none$");
Then, you just exclude the matches.
Matcher m = p.matcher("your string");
System.out.println(s + ": " + (m.matches() ? "NO Match" : "Match"));

Android - Java - Regular Expression question - consecutive words not being matched

For my example I am trying to replace ALL cases of "the" and "a" in a string with a space.
Including cases where these words are next to characters such as quotes and other punctuation
String oldString = "A test of the exp."
Pattern p = Pattern.compile("(((\\W|\\A)the(\\W|\\Z))|((\\W|\\A)a(\\W|\\Z)))",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(oldString);
newString = m.replaceAll(" ");
"A test of the exp." returns "test of exp." - Yeah!
"A test of the a exp." returns "test of a exp." - Boooo!
"The a in this test is a the." returns "a in this test is the. - DoubleBoooo!
Any help would be greatly appreciated.
Thanks!

String resultString = subjectString.replaceAll("\\b(?:a|the)\\b", " ");
\b matches at a word boundary (i. e. at the start or end of a word, where "word" is a sequence of alphanumeric characters).
(?:...) is a non-capturing group, needed to separate the alternative words (in this case a and the) from the surrounding word boundary anchors.

Or per simplified #Robokop soln.
Pattern.compile("(\\b(the|a)\\b)",Pattern.CASE_INSENSITIVE);
or
Pattern.compile('\b(the|a)\b',Pattern.CASE_INSENSITIVE);
Not sure about quoting in Java.

Pattern.compile("(\\bthe\\b)|(\\ba\\b)",Pattern.CASE_INSENSITIVE);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex exact match with question mark and word boundary - java

Related

Java regex, replace certain characters except if it matches a pattern

Masking using regular expressions for below format

How can I strip all non digits in a string except the first character?

java regex to accept any word other than none

Android - Java - Regular Expression question - consecutive words not being matched

Categories

Resources