Regex (?U)\p{Punct} is missing some Unicode punctuation signs in Java - java

First of all, I want to remove all punctuation signs in a String. I wrote the following code.
Pattern pattern = Pattern.compile("\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~(hello)");
if (matcher.find())
System.out.println(matcher.replaceAll(""));
After replacement I got this output: (hello).
So the pattern matches the one of !"#$%&'()*+,-./:;<=>?#[\]^_{|}~`, which matches the official docs.
But I want to remove "(" Fullwidth Left Parenthesis U+FF08* and ")" Fullwidth Right Parenthesis U+FF09 as well, so I changed my code to this:
Pattern pattern = Pattern.compile("(?U)\\p{Punct}");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~()");
if (matcher.find())
System.out.println(matcher.replaceAll(""));
After replacement, I got this output: $+<=>^|~`
It indeed matched "(" Fullwidth Left Parenthesis U+FF08* and ")" Fullwidth Right Parenthesis U+FF09, bit it missed $+<=>^|~`.
I am so confused. Why did that happen? Can anyone give some help?

Unicode (that is when you use (?U)) and POSIX (when not using (?U)) disagrees on what counts as a punctuation.
When you don't use (?U), \p{Punct} matches the POSIX punctuation character class, which is just
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
When you use (?U), \p{Punct} matches the Unicode Punctuation category, which does not include some of the characters in the above list, namely:
$+<=>^`|~
For example, the Unicode category for $ is "Symbol, Currency", or Sc. See here.
If you want to match $+<=>^`|~, plus all the Unicode punctuations, you can put them both in a character class. You can also just directly use the Unicode category "P", rather than turning on Unicode mode with (?U).
Pattern pattern = Pattern.compile("[\\p{P}$+<=>^`|~]");
Matcher matcher = pattern.matcher("!\"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~()");
// you don't need "find" first
System.out.println(matcher.replaceAll(""));

Related

Regex to match words after forward slash or in between

I have this code that needs to get words after / or in between this character.
Pattern pattern = Pattern.compile("\\/([a-zA-Z0-9]{0,})"); // Regex: \/([a-zA-Z0-9]{0,})
Matcher matcher = pattern.matcher(path);
if(matcher.matches()){
return matcher.group(0);
}
The regex \/([a-zA-Z0-9]{0,}) works but not in Java, what could be the reason?
You need to get the value of Group 1 and use find to get a partial match:
Pattern pattern = Pattern.compile("/([a-zA-Z0-9]*)");
Matcher matcher = pattern.matcher(path);
if(matcher.find()){
return matcher.group(1); // Here, use Group 1 value
}
Matcher.matches requires a full string match, only use it if your string fully matches the pattern. Else, use Matcher.find.
Since the value you need is captured into Group 1 (([a-zA-Z0-9]*), the subpattern enclosed with parentheses), you need to return that part.
You needn't escape the / in Java regex. Also, {0,} functions the same way as * quantifier (matches zero or more occurrences of the quantified subpattern).
Also, [a-zA-Z0-9] can be replaced with \p{Alnum} to match the same range of characters (see Java regex syntax reference. The pattern declaration will look like
"/(\\p{Alnum}*)"

regular expression \b in java and javascript

Is there any difference of use regular expression \b in java and js?
I tried below test:
in javascript:
console.log(/\w+\b/.test("test中文"));//true
in java:
String regEx = "\\w+\\b";
text = "test中文";
Pattern pattern = Pattern.compile(regEx);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
System.out.println("matched");//never executed
}
Why the result of the two example above are not same?
That is because by default Java supports Unicode for \b but not for \w, while JavaScript doesn't support Unicode for both.
So \w can only match [a-zA-Z0-9_] characters (in our case test) but \b can't accept place (marked with |)
test|中文
as between alphabetic and non-alphabetic Unicode standards because both t and 中 are considered alphabetic characters by Unicode.
If you want to have \b which will ignore Unicode you can use look-around mechanism and rewrite it as (?:(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)), or in case of this example simple (?!\\w) instead of \\b will also work.
If you want \w to also support Unicode compile your pattern with Pattern.UNICODE_CHARACTER_CLASS flag (which can also be written as flag expression (?U))
The Jeva regex looks for a sequence of word characters, i.e. [a-zA-Z_0-9]+ preceding a word boundary. But 中文 doesn't fit \w. If you use \\b alone, you'll find two matches: begin and end of the string.
As has been pointed out by georg, Javascript isn't interpreting characters the same way as Java's Regex engine.

Pattern Matching for java using regex

I have a Long string that I have to parse for different keywords. For example, I have the String:
"==References== This is a reference ==Further reading== *{{cite book|editor1-last=Lukes|editor1-first=Steven|editor2-last=Carrithers|}} * ==External links=="
And my keywords are
'==References==' '==External links==' '==Further reading=='
I have tried a lot of combination of regex but i am not able to recover all the strings.
the code i have tried:
Pattern pattern = Pattern.compile("\\=+[A-Za-z]\\=+");
Matcher matcher = pattern.matcher(textBuffer.toString());
while (matcher.find()) {
System.out.println(matcher.group(0));
}
You don't need to escape the = sign. And you should also include a whitespace inside your character class.
Apart from that, you also need a quantifier on your character class to match multiple occurrences. Try with this regex:
Pattern pattern = Pattern.compile("=+[A-Za-z ]+=+");
You can also increase the flexibility to accept any characters in between two =='s, by using .+? (You need reluctant quantifier with . to stop it from matching everything till the last ==) or [^=]+:
Pattern pattern = Pattern.compile("=+[^=]+=+");
If the number of ='s are same on both sides, then you need to modify your regex to use capture group, and backreference:
"(=+)[^=]+\\1"

Matching several URLs in a string using regex

I'm trying to match a URL in a string, using regex from here: Regular expression to match URLs in Java
It works fine with one URL, but when I have two URLs in the string, it only matched the latter.
Here's the code:
Pattern pat = Pattern.compile(".*((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
// now matcher.groupCount() == 2, not 4
Edit: stuff I've tried:
// .* removed, now doesn't match anything // Another edit: actually works, see below
Pattern pat = Pattern.compile("((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
// .* made lazy, still only matches one
Pattern pat = Pattern.compile(".*?((https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|])", Pattern.DOTALL);
Any ideas?
It's because .* is greedy. It will just consume as much as possible (the whole string) and then backtrack. I.e. it will throw away one character at a time until the remaining characters can make up a URL. Hence the first URL will already be matched, but not captured. And unfortunately, matches cannot overlap. The fix should be simple. Remove the .* at the beginning of your pattern. Then you can also remove the outer parentheses from your pattern - there is no need to capture anything any more, because the whole match will be the URL you are looking for.
Pattern pat = Pattern.compile("(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]", Pattern.DOTALL);
Matcher matcher = pat.matcher("asdasd http://www.asd.as/asd/123 or http://qwe.qw/qwe");
while (matcher.find()) {
System.out.println(matcher.group());
}
By the way, matcher.groupCount() does not tell you anything, because it gives you the number of groups in your pattern and not the number of captures in your target string. That's why your second approach (using .*?) did not help. You still have two capturing groups in the patter. Before calling find or anything, matcher does not know how many captures it will find in total.

Java Regex for username

I'm looking for a regex in Java, java.util.regex, to accept only letters ’, -, and . and a range of Unicode characters such as umlauts, eszett, diacritic and other valid letters from European languages.
What I don't want is numbers, spaces like “ ” or “ Tom”, or special characters like !”£$% etc.
So far I'm finding it very confusing.
I started with this
[A-Za-z.\\s\\-\\.\\W]+$
And ended up with this:
[A-Za-z.\\s\\-\\.\\D[^!\"£$%\\^&*:;##~,/?]]+$
Using the cavat to say none of the inner square brackets, according to the documentation
Anyone have any suggestions for a new regex or reasons why the above isn't working?
For my answer, I want to use a simpler regex similar to yours: [A-Z[^!]]+, which means "At least once: (a character from A to Z) or (a character that is not '!').
Note that "not '!'" already includes A to Z. So everything in the outer character group([A-Z...) is pointless.
Try [\p{Alpha}'-.]+ and compile the regex with the Pattern.UNICODE_CHARACTER_CLASS flag.
Use: (?=.*[##$%&\s]) - Return true when atleast one special character (from set) and also if username contain space.
you can add more special character as per your requirment. For Example:
String str = "k$shor";
String regex = "(?=.*[##$%&\\s])";
Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.find()); => gives true

Categories

Resources