validate special characters by negating unicode letters with regex pattern?

validate special characters by negating unicode letters with regex pattern? - java

This regex: \p{L}+ matches these characters "ASKJKSDJKDSJÄÖÅüé" of the example string "ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé" which is great but is the exact opposite of what I want. Which leads me to negating regexes.
Goal:
I want to match any and all characters that are not a letter nor a number in multiple languages.
Could a negative regex be a natural direction for this?
I should mention one intended use for the regex I'd like to find is to validate passwords for the rule:
that it needs to contain at least one special character, which I
define as not being a number nor a letter.
It would seem defining ranges of special characters should be avoided if possible, because why limit the possibilities? Thus my definition. I assume there could be some problems with such a wide definition, but it is a first step.
If you have some suggestions for a better solution I'm giving below or just have some thoughts on the subject, I'm sure I'm not the only one that would like to learn about it. Thanks.
Note I'm using double \\ in the Java code. Platform is Java 11.

You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:
Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));
That prints:
_-.;,!”#€%&/()=?`¨’<>
Which is exactly what you're looking for, no?
No need to mess with lookaheads here.

So after having read similar, though not identical questions and some equally great answers, I came up with this solution: (?=\P{L})(?=\P{N}) meaning match both not letters and not numbers. Even if I'm asserting numbers separately I need to negate both to meet the specification of special characters (See question).
This is making use of a non-consuming regular expression with the parentheses and the?=, first matching the expression in the first parenthesis and after that continue to match the whole in the second. Thanks to #Jason Cohen for this detail in the Regular Expressions: Is there an AND operator? discussion.
The upper case P in \P{L} and \P{N} expresses the "not belonging to a category" in Unicode Categories, where the uppercase P means "not", i e the opposite of a lowercase p.
It's not perfect for a real world solution, but works as a starting point at least. Note I'm using double \\ in the Java code. Platform is Java 11.

Related

Java regex not matching German "Umlaut" OR underscore

I'm trying to "play around" with some REST APIs and Java code.
As I am using German language mainly, I already managed it to get the Apache HTTP Client to work with UTF-8 encoding to make sure "Umlaut" are handled the right way.
Still I can't get my regex to match my words correctly.
I try to find words/word combinations like "Büro_Licht" from string like ..."type":"Büro_Licht"....
Using regex expression ".*?type\":\"(\\w+).*?" returns "B" for me, as it doesn't recognize the "ü" as a word character. Clearly, as \w is said to be [a-z A-Z 0-9]. Within strings with no special characters I get the full "Office_Light" meanwhile.
So I tried another hint mentioned here in like nearly the same question (which I could not comment, because I lack of reputation points).
Using regex expression ".*?type\":\"(\\p{L}).*?" returns "Büro" for me. But here again it cuts on the underscore for a reason I don't understand.
Is there a nice way to combine both expressions to get the "full" word including underscores and special characters?

If you have to keep using regex, which is not a great tool for parsing JSON, try \p{L}_. In your case it would be:
String regex = ".*?type\":\"[\\p{L}_]+\"";
With on-line example: https://regex101.com/r/57oFD5/2
\p{L} matches any kind of letter from any language
_ matches the character _ literally (case sensitive)
This will get hectic if you need to support other languages, whitespaces and various other UTF code points. For example do you need to support random number of white spaces around :? Take a look at this answer on removing emojis, there are many corner cases.

Extract last occurrence of a string [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?

The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.
Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.
If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)
Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.
The pattern . is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].
Think of character classes as menus: pick just one.
Helpful shortcuts
Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).
The uppercased variants are their complements, so \S matches any non-whitespace character, for example.
Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are
* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
Putting some of these blocks together, the pattern [Nn]*ick matches all of
ick
Nick
nick
Nnick
nNick
nnick
(and so on)
The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.
A few other useful examples:
[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01
Grouping
A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.
To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.
Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).
For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.
Escaping
Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.
Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
"Hello," she said, "How are you?"
You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.
To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how $(.+?)$, the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.
If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)
Anchors
Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.
Say you want to match comments of the form
-- This is a comment --
you'd write ^--\s+(.+)\s+--$.
Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
RegExr (for JavaScript)
Perl: YAPE: Regex Explain
Regex Coach (engine backed by CL-PPCRE)
RegexPal (for JavaScript)
Regular Expressions Online Tester
Regex Buddy
Regex 101 (for PCRE, JavaScript, Python, Golang, Java 8)
I Hate Regex
Visual RegExp
Expresso (for .NET)
Rubular (for Ruby)
Regular Expression Library (Predefined Regexes for common scenarios)
Txt2RE
Regex Tester (for JavaScript)
Regex Storm (for .NET)
Debuggex (visual regex tester and helper)
Books
Mastering Regular Expressions, the 2nd Edition, and the 3rd edition.
Regular Expressions Cheat Sheet
Regex Cookbook
Teach Yourself Regular Expressions
Free resources
RegexOne - Learn with simple, interactive exercises.
Regular Expressions - Everything you should know (PDF Series)
Regex Syntax Summary
How Regexes Work
JavaScript Regular Expressions
Footnote
†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.

Using regular expression to find at least one repeating character in a string

So I have a bonus task assigned and it asks to write a program which returns true if in a given string at least one character is repeated.
I am relatively new to regular expressions but to my knowledge this should work:
String input = "wool";
return input.matches(".*(.)/1+.*");
This should return true, because the '.*' at the beginning and the end express that there could be prefices or suffices. And the '(.)/1+' is a repeating pattern of any character.
As I said I'm relatively new to the regex stuff but I'm very interested in learning and understanding it.

Almost perfect, just / looks the wrong way around (should be \).
Also, you don't need .* for prefixes and suffixes - regexp will find a match anywhere in the string, so (.)\1 suffices. This is not an error, just an optimisation (although in other cases it might, and does, make a difference).
One more issue is that backslashes are special characters in Java strings, so when you write a regexp in Java, you need to double up on backslashes. This gives you:
return input.matches(".*(.)\\1.*");
EDIT: I forgot, you don't need + because if something repeats 3 times, it also repeats 2 times, so you will find it just by searching for a two-character repetition. Again, not an error, just not needed here.
And Kita has a good point that your task is not well-defined, as it does not say whether you are looking for the repeating characters next to each other or anywhere in the string. My solution is for the adjacent characters; if you need the repetition anywhere, use his.
EDIT2 after comments: Forgot the semantics of .matches. You guys are quite correct, edited appropriately.

If the task "in a given string at least one character is repeated" includes the following pattern:
abcbd (b is repeated)
then the regex pattern would be:
(.).*\1
This pattern assumes that other characters could be in between the repeating characters. Otherwise
(.)\1
will do.
Note that the task is to capture "at least one character is repeated", which means identifying a single occurrence is enough for the task, so \1 does not have to have + quantifier.
The code:
return input.matches("(.).*\\1");
or
return input.matches("(.)\\1");

Alternative solution would be adding the elements to hashset. Then checking the length of the string and the hashset.

Java regex alternation operator "|" behavior seems broken

Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV" will match "IV" for "IV" and "III" for "III"
In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex to "IV|III|II|I", the behavior is corrected, but this obviously isn't a solution in general.
Is there a way to make Java choose the longest match out of an alternation group, instead of choosing the 'first'?
A code sample for clarity:
public static void main(String[] args)
{
Pattern p = Pattern.compile("six|sixty");
Matcher m = p.matcher("The year was nineteen sixty five.");
if (m.find())
{
System.out.println(m.group());
}
else
{
System.out.println("wtf?");
}
}
This outputs "six"

No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.
You can force it to continue by adding a condition after the alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($) or a word boundary (\b).
"\\b(I|II|III|IV)\\b"
EDIT: I should mention that, while grep, sed, awk and others traditionally use text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.

I think a pattern that will work is something like
IV|I{1,3}
See the "greedy quantifiers" section at http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
Edit: in response to your comment, I think the general problem is that you keep using alternation when it is not the right thing to use. In your new example, you are trying to match "six" or "sixty"; the right pattern to use is six(ty)?, not six|sixty. In general, if you ever have two members of an alternation group such that one is a prefix of another, you should rewrite the regular expression to eliminate it. Otherwise, you can't really complain that the engine is doing the wrong thing, since the semantics of alternation don't say anything about a longest match.
Edit 2: the literal answer to your question is no, it can't be forced (and my commentary is that you shouldn't ever need this behavior).
Edit 3: thinking more about the subject, it occurred to me that an alternation pattern where one string is the prefix of another is undesirable for another reason; namely, it will be slower unless the underlying automaton is constructed to take prefixes into account (and given that Java picks the first match in the pattern, I would guess that this is not the case).

two regex patterns, can they be one?

I have two regular expressions that I use to validate Colorado driver's license formats.
[0-9]{2}[-][0-9]{3}[-][0-9]{4}
and
[0-9]{9}
We have to allow for only 9 digits but the user is free to enter it in as 123456789 or 12-345-6789.
Is there a way I can combine these into one? Like a regex conditional statement of sorts? Right now I am simply enumerating through all the available formats and breaking out once one is matched. I could always strip the hyphens out before I do the compare and only use [0-9]{9}, but then I won't be learning anything new.

For a straight combine,
(?:[0-9]{2}[-][0-9]{3}[-][0-9]{4}|[0-9]{9})
or to merge the logic (allowing dashes in one position without the other, which may not be desired),
[0-9]{2}-?[0-9]{3}-?[0-9]{4}
(The brackets around the hyphens in your first regex aren't doing anything.)
Or merging the logic so that both hyphens are required if one is present,
(?:\d{2}-\d{3}-|\d{5})\d{4}
(Your [0-9]s can also be replaced with \ds.)

How about using a backreference to match the second hyphen only if the first is given:
\d{2}(-?)\d{3}\1\d{4}
Although I've never used regexes in Java so if it's supported the syntax might be different. I've just tried this out in Ruby.

A neat version which will allow either dashes or not dashes is:
\d{2}(-?)\d{3}\1\d{4}
The capture (the brackets) will capture either '-' or nothing. The \1 will match again whatever was captured.

I think this should work:
\d{2}-?\d{3}-?\d{4}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.