regex pattern java symbols

regex pattern java symbols - java

I am looking for a regex pattern in Java that corresponds to all characters except the letters a to z.
In other words, I want a regex pattern that corresponds to symbols such as
!"#¤%&/()=?`´\}}][{€$#
Or some way to trim a string into letters only.
As an example lets consider the following string:
"one!#"¤%()=) two}]}[()\ three[{€$"
to:
"one two three"

The Unicode version would be
\PL
\PL are all Unicode code points that does not have the property "Letter".
\pL would be the counterpart, all Unicode code points that does have the property "Letter".
Maybe you can fine here on regular-expressions.info some properties that match your needs better.
You can also combine them into character classes, the same than you would handle predefined classes, e.g.
[^\pl\pN]
Would match any character that is not a letter or a digit numeric character in Unicode.

As an example lets consider the following string:
"one!#"¤%()=) two}]}[()\ three[{€$"
to:
"one two three"
The pattern needed is to match everything that is neither a letter nor a separator. Otherwise you would end up with "onetwothree" instead of the "one two three" you asked for.
[^\pL\pZ]

[^a-zA-Z] is a character class that matches every character apart from the letters a to z in lower or upper case.

The simplest form : [^a-z]
Could also be [^a-zA-Z] if you want to remove uppercase letters also.

Related

Java regex matches diacritics for the Latin corresponding characters

I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this.
I didn't try to compare strings character by character because of performance time.
Is there anyway I can make the regex match exactly for these characters?

If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.
000074 t LATIN SMALL LETTER T
+
000326 ̦ COMBINING COMMA BELOW
=
00021B ț LATIN SMALL LETTER T WITH COMMA BELOW
Just incase, you should use a hex codepoint to represent them ie. u\021B
Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.
Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.
updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.
In Java, the regex token \uFFFF only matches the specified
code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert
Unicode characters into literal strings in the Java source
code. Pattern.compile("\u00E0") will match both the
single-code-point and double-code-point encodings of à,
while Pattern.compile("\u00E0") matches only the
single-code-point version. Remember that when writing a
regex as a Java string literal, backslashes must be escaped.
The former Java code compiles the regex à, while the latter
compiles \u00E0. Depending on what you're doing, the
difference may be significant.
So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match
000061 a LATIN SMALL LETTER A
or
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.
Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)
Doing this forces it to match either
000061 a LATIN SMALL LETTER A
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
It won't match a independent of the grave like you see now.
Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.

I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.
Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.
I hope this helps!

As already mentioned in Unicode one has two alternatives.
'\u0061' 'a' LATIN SMALL LETTER A
'\u0300' ̀ COMBINING GRAVE ACCENT
or
'\u00E0' 'à' LATIN SMALL LETTER A WITH GRAVE
There is a Normalizer that can "normalize" to either form (and deal with ligatures):
String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);
Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.
The normalizer should especially be applied on the searched-through string.

How to allow set of special characters in existing regex?

We have regex to validate password with one digit, one upper case letter and one lower case letter. Regex is:
^\w*(?=\w*\d)(?=\w*[a-z])(?=\w*[A-Z])\w*$
This regex will not allow any special characters. I need to change regex to allow some list of special characters and there should not be any restriction that there must be at least one special character. Only [-!$%^&*()_+|~=`{}\[\]:";'<>?,.\/] should be allowed as special characters without must have one restriction.
I tried:
^\w*(?=\w*\d)(?=\w*[a-z])(?=\w*[A-Z])(?=\w*[-!$%^&*()_+|~=`{}\[\]:";'<>?,.\/]*)\w*$
and which seems to be wrong. Please some one help.

This is because of \w* before $.You are specifically trying to match 0 to many words..Try this:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])[-\w!$%^&*()_+|~=`{}\[\]:";'<>?,.\/]+$
OR
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])[\w\p{Punct}]+$
\p{Punct} is a special character class similar to [!"#$%&'()*+,\-./:;<=>?#[\\\]^_{|}~]

Java regex match all characters except

What is the correct syntax for matching all characters except specific ones.
For example I'd like to match everything but letters [A-Z] [a-z] and numbers [0-9].
I have
string.matches("[^[A-Z][a-z][0-9]]")
Is this incorrect?

Yes, you don't need nested [] like that. Use this instead:
"[^A-Za-z0-9]"
It's all one character class.

If you want to match anything but letters, you should have a look into Unicode properties.
\p{L} is any kind of letter from any language
Using an uppercase "P" instead it is the negation, so \P{L} would match anything that is not a letter.
\d or \p{Nd} is matching digits
So your expression in modern Unicode style would look like this
Either using a negated character class
[^\p{L}\p{Nd}]
or negated properties
[\P{L}\P{Nd}]
The next thing is, matches() matches the expression against the complete string, so your expression is only true with exactly one char in the string. So you would need to add a quantifier:
string.matches("[^\p{L}\p{Nd}]+")
returns true, when the complete string has only non alphanumerics and at least one of them.

Almost right. What you want is:
string.matches("[^A-Za-z0-9]")
Here's a good tutorial

string.matches("[^A-Za-z0-9]")

Lets say that you want to make sure that no Strings have the _ symbol in them, then you would simply use something like this.
Pattern pattern = Pattern.compile("_");
Matcher matcher = Pattern.matcher(stringName);
if(!matcher.find()){
System.out.println("Valid String");
}else{
System.out.println("Invalid String");
}

You can negate character classes:
"[^abc]" // matches any character except a, b, or c (negation).
"[^a-zA-Z0-9]" // matches non-alphanumeric characters.

Regex to find all variants of a certain character inside a text

I am trying to find unicode variants of a user-entered character in a text for highlighting them. E.g. if user enters "Beyonce" i'd like to highlight all text with variants like "Beyoncé" or "Beyônce" or Bèyönce" in the text. Currenty the only idea i have is creating a regex by replacing the input string with a set of character groups like this:
"Beyonce" => "B[eêéè]y[óòôö]c[éèê]"
But this seems to be a very tedious and error prone way of doing it. What I am basically looking for is a regex character group that matches all variants of a given input character, something like \p{M} but with the possibility to specify the base letter. Is there something available like this in java regex? And if not, how could the regex creation process be improved? I don't think that specifying all variants by hand is going to work in the long run.

There are several ways, an accented character can be represented. There's a good example in the javadoc of java.text.Normalizer:
For example, take the character A-acute. In Unicode, this can be encoded
as a single character (the "composed" form):
U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
U+0041 LATIN CAPITAL LETTER A
U+0301 COMBINING ACUTE ACCENT
The second form would make it relatively easy to access the non-accentuated character, and fortunately Normalizer can help you here:
Normalizer.normalize(text, Form.NFD); // NFD = "Canonical decomposition"
You can then use a regex to ignore (or remove) any non-ASCII characters from the string, based on:
[^\p{ASCII}]

Regular expression to match simple "id" values?

I need regex for a line that starts with two characters followed by 2-4 digits or 2-4 digits followed by "-" and followed by 2-4 digits.
Examples:
AB125
AC123-25
BT1-2535
Seems simple , but I got stuck with it ...

Regular expressions always seem simple, right up to the point where you try to use them :-)
This particular one can be done with something along the lines of:
^[A-Z]{2}([0-9]{2,4}-)?[0-9]{2,4}$
That's:
2 alpha (uppercase) characters.
an optional 2-to-4-digit and hyphen sequence.
a mandatory 2-to-4-digit sequence.
start and end markers.
That last one, BT1-2535, doesn't match your textual specification by the way since it only has one digit before the hyphen. I'm assuming that was a typo. You will also have to change the character bit to use [A-Za-z] if you want to allow lowercase as well.

How about:
^[A-Z]{2}\d{2,4}(?:-\d{2,4})?
This matches two uppercase letters followed by 2-4 digits, followed by (optionally) a hyphen and another 2-4 digits.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.