regular expression \b in java and javascript - java

Is there any difference of use regular expression \b in java and js?
I tried below test:
in javascript:
console.log(/\w+\b/.test("test中文"));//true
in java:
String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("matched");//never executed
}
Why the result of the two example above are not same?

That is because by default Java supports Unicode for \b but not for \w, while JavaScript doesn't support Unicode for both.
So \w can only match [a-zA-Z0-9_] characters (in our case test) but \b can't accept place (marked with |)
test|中文
as between alphabetic and non-alphabetic Unicode standards because both t and 中 are considered alphabetic characters by Unicode.
If you want to have \b which will ignore Unicode you can use look-around mechanism and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)), or in case of this example simple (?!\\w) instead of \\b will also work.
If you want \w to also support Unicode compile your pattern with Pattern.UNICODE_CHARACTER_CLASS flag (which can also be written as flag expression (?U))

The Jeva regex looks for a sequence of word characters, i.e. [a-zA-Z_0-9]+ preceding a word boundary. But 中文 doesn't fit \w. If you use \\b alone, you'll find two matches: begin and end of the string.
As has been pointed out by georg, Javascript isn't interpreting characters the same way as Java's Regex engine.

Related

Regex (?U)\p{Punct} is missing some Unicode punctuation signs in Java

First of all, I want to remove all punctuation signs in a String. I wrote the following code.
Pattern pattern = Pattern.compile("\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~(hello)");
if (matcher.find())
System.out.println(matcher.replaceAll(""));
After replacement I got this output: (hello).
So the pattern matches the one of !"#$%&'()*+,-./:;<=>?#[\]^_{|}~`, which matches the official docs.
But I want to remove "(" Fullwidth Left Parenthesis U+FF08* and ")" Fullwidth Right Parenthesis U+FF09 as well, so I changed my code to this:
Pattern pattern = Pattern.compile("(?U)\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~()");
if (matcher.find())
System.out.println(matcher.replaceAll(""));
After replacement, I got this output: $+<=>^|~`
It indeed matched "(" Fullwidth Left Parenthesis U+FF08* and ")" Fullwidth Right Parenthesis U+FF09, bit it missed $+<=>^|~`.
I am so confused. Why did that happen? Can anyone give some help?
Unicode (that is when you use (?U)) and POSIX (when not using (?U)) disagrees on what counts as a punctuation.
When you don't use (?U), \p{Punct} matches the POSIX punctuation character class, which is just
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
When you use (?U), \p{Punct} matches the Unicode Punctuation category, which does not include some of the characters in the above list, namely:
$+<=>^`|~
For example, the Unicode category for $ is "Symbol, Currency", or Sc. See here.
If you want to match $+<=>^`|~, plus all the Unicode punctuations, you can put them both in a character class. You can also just directly use the Unicode category "P", rather than turning on Unicode mode with (?U).
Pattern pattern = Pattern.compile("[\\p{P}$+<=>^`|~]");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~()");
// you don't need "find" first
System.out.println(matcher.replaceAll(""));

Why there is no special regular expression construct for backspace character ("\b") like for \\t, \\n, \\r, and \\f in Java?

I’m wondering what is the reason of providing special regular-expression constructs for the following characters:
\t - The tab character ('\u0009')
\n - The newline (line feed) character ('\u000A')
\r - The carriage-return character ('\u000D')
\f - The form-feed character ('\u000C')
and, on the other hand, not providing one for backspace character (\b).
As it is shown in this question, there is definitely a difference between "\\n" compared to "\n" or "\\t" compared to "\t", when Pattern.COMMENTS flag is used, but I think it doesn't answer the question, why there is no regular expression construct for backspace character.
Isn't there any possible use case for a regular expression construct for backspace character, not only when Pattern.COMMENTS flag is set as active, but maybe in other cases that I don't know yet? Why backspace character is considered as different comparing to other whitespace characters listed above that lead to decision of not providing a regular expression construct for backspace character?
Java regex originated from Perl regex, where most shorthand classes have already been defined. Since Perl regex users got accustomed to use "\\b" as a word boundary change already accepted and well-known shorthands. "\\b" in Perl regex matches a word boundary, and it came with this meaning to Java regex. See this Java regex documentation:
The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary.
Currently, you can't even make "\\b" act as a backspace inside a character set (as in some other languages, e.g. in Python), it is done specifically to avoid human errors when writing patterns. According to the latest specs
It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language.
If you have to use a regex escape for a backspace, use a Unicode regex escape "\\u0008":
Java online demo:
String s = "word1 and\bword2";
System.out.println(Arrays.toString(s.split("\\b"))); // WB
// => [word1, , and, , word2]
System.out.println(Arrays.toString(s.split("\b"))); // BS
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("[\b]"))); // BS in a char set
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("\\u0008"))); // BS as a Unicode regex escape
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("[\\b]")));// WB NOT treated as BS in a char set
// => java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 2

Java Regex for username

I'm looking for a regex in Java, java.util.regex, to accept only letters ’, -, and . and a range of Unicode characters such as umlauts, eszett, diacritic and other valid letters from European languages.
What I don't want is numbers, spaces like “ ” or “ Tom”, or special characters like !”£$% etc.
So far I'm finding it very confusing.
I started with this
[A-Za-z.\\s\\-\\.\\W]+$
And ended up with this:
[A-Za-z.\\s\\-\\.\\D[^!\"£$%\\^&*:;##~,/?]]+$
Using the cavat to say none of the inner square brackets, according to the documentation
Anyone have any suggestions for a new regex or reasons why the above isn't working?
For my answer, I want to use a simpler regex similar to yours: [A-Z[^!]]+, which means "At least once: (a character from A to Z) or (a character that is not '!').
Note that "not '!'" already includes A to Z. So everything in the outer character group([A-Z...) is pointless.
Try [\p{Alpha}'-.]+ and compile the regex with the Pattern.UNICODE_CHARACTER_CLASS flag.
Use: (?=.*[##$%&\s]) - Return true when atleast one special character (from set) and also if username contain space.
you can add more special character as per your requirment. For Example:
String str = "k$shor";
String regex = "(?=.*[##$%&\\s])";
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.find()); => gives true

Word boundry \b in Java RegEx

I am having difficulty with using \b as a word delimiter in Java Regex.
For
text = "/* sql statement */ INSERT INTO someTable";
Pattern.compile("(?i)\binsert\b"); no match found
Pattern insPtrn = Pattern.compile("\bINSERT\b"); no match found
but
Pattern insPtrn = Pattern.compile("INSERT"); finds a match
Any idea what I am doing wrong?
When writing regular expressions in Java, you need to be sure to escape all of the backslashes, so the regex \bINSERT\b becomes "\\bINSERT\\b" as a Java string.
If you do not escape the backslash, then the \b in the string literal is interpreted as a backspace character.
Use this instead: -
Pattern insPtrn = Pattern.compile("\\bINSERT\\b")
You need to escape \b with an extra backslash..

How to match \Q and \E in Java regex?

I want to match \Q and \E in a Java regex.
I am writing a program which will compute the length of the string, matching to the pattern (this program assumes that there is no any quantifier in regex except {some number}, that's why the length of the string is uniquely defined) and I want at first delete all expressions like \Qsome text\E.
But regex like this:
"\\Q\\Q\\E\\Q\\E\\E"
obviously doesn't work.
Use Pattern.quote(...):
String s = "\\Q\\Q\\E\\Q\\E\\E";
String escaped = Pattern.quote(s);
Just escape the backslashes. The sequence \\\\ matches a literal backslash, so to match a literal \Q:
"\\\\Q"
and to match a literal \E:
"\\\\E"
You can make it more readable for a maintainer by making it obvious that each sequence matches a single character using [...] as in:
"[\\\\][Q]"

Categories

Resources