Regex matching only prefix or only suffix (XOR)

Regex matching only prefix or only suffix (XOR) - java

I have a pattern, which I'll refer to as Z (actual pattern is a bit long and not important to the question). Simply put, I want to be able to match either \*\sZ, or Z\:, but not both nor neither.
I attempted using lookaheads (similar to below), however because of the pattern between the prefix and suffix they wouldn't work.
(\*\s(?!\:))Z((?<!\*)\:)
Is there a way of accomplishing this without having to duplicate the pattern (e.g. (\*\sZ|Z\:))?
A quick note about my pattern is there is no \* in the Z pattern, only in the prefix. Conversely there's also no \: in the Z pattern, it's only in the suffix if immediately proceeding Z, but not after any other characters (there's a .* capture after the suffix)

Is there a way of accomplishing this without having to duplicate the
pattern?
The answer is "NO". Unlike and and or which are fundamental properties of regular expression. In regular expression, you can easily construct and expression by using concatenation and construct or expression by using | respectively.
But anyway, if you still want get your job done I suggest you to do this.
First, you already had two patterns here
\*\sZ
and
Z\:
So, as you said, these two patterns could not be occurred at the same time.
So from properties of xor:
A xor B = (A & ~B)|(~A & B).
Finally, we can get
\*\sZ(?!\:)|(?<!\*\s)Z\:
See a DEMO

Related

validate special characters by negating unicode letters with regex pattern?

This regex: \p{L}+ matches these characters "ASKJKSDJKDSJÄÖÅüé" of the example string "ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé" which is great but is the exact opposite of what I want. Which leads me to negating regexes.
Goal:
I want to match any and all characters that are not a letter nor a number in multiple languages.
Could a negative regex be a natural direction for this?
I should mention one intended use for the regex I'd like to find is to validate passwords for the rule:
that it needs to contain at least one special character, which I
define as not being a number nor a letter.
It would seem defining ranges of special characters should be avoided if possible, because why limit the possibilities? Thus my definition. I assume there could be some problems with such a wide definition, but it is a first step.
If you have some suggestions for a better solution I'm giving below or just have some thoughts on the subject, I'm sure I'm not the only one that would like to learn about it. Thanks.
Note I'm using double \\ in the Java code. Platform is Java 11.

You can shove those \\p things in []. And thus, use the fact that you can negate chargroups. This is all you need:
Pattern p = Pattern.compile("[^\\p{L}]");
Matcher m = p.matcher("ASKJKSDJK_-.;,DSJÄÖÅ!”#€%&/()=?`¨’<>üé");
while (m.find()) System.out.print(m.group(0));
That prints:
_-.;,!”#€%&/()=?`¨’<>
Which is exactly what you're looking for, no?
No need to mess with lookaheads here.

So after having read similar, though not identical questions and some equally great answers, I came up with this solution: (?=\P{L})(?=\P{N}) meaning match both not letters and not numbers. Even if I'm asserting numbers separately I need to negate both to meet the specification of special characters (See question).
This is making use of a non-consuming regular expression with the parentheses and the?=, first matching the expression in the first parenthesis and after that continue to match the whole in the second. Thanks to #Jason Cohen for this detail in the Regular Expressions: Is there an AND operator? discussion.
The upper case P in \P{L} and \P{N} expresses the "not belonging to a category" in Unicode Categories, where the uppercase P means "not", i e the opposite of a lowercase p.
It's not perfect for a real world solution, but works as a starting point at least. Note I'm using double \\ in the Java code. Platform is Java 11.

Java Regex Lookahead Conditional

I have a regex which works, but unfortunately not in Java because Java does not support this type of inline modifier.
I have already read about this topic e. g. here:
Java support for conditional lookahead
Java Regex Pattern compilation error
My regex:
(?(?=\d{1,2}[.]\d{1,2}[.]\d{2,4})somerandomtextwhichisnotinthetext|^((($|EUR)? ?[-+]?(\d{1,8}[.,])*\d+([.,]\d+)?)|([-+]?(\d{1,8}[.,])*\d+([.,]\d+)? ?($|€|EUR)?))$)
I also tried a lookbehind but the pattern it should be matched has a variable length an this is unfortunately not supported...
The regex should me matches all of this pattern (a full match is needed --> matcher.group(0) ):
123.342,22
123,233.22
232,11
232.2
232.2 €
but not this:
06.01.99
And it needs to be implemented in Java.
But still I have no solution...
Thanks for your help!!!

The point here is that you need to use the first part as a negative lookahead to add an exception to the other pattern:
^(?!\d{1,2}[.]\d{1,2}[.]\d{2,4}$)((($|EUR)? ?[-+]?(\d{1,8}[.,])*\d+([.,]\d+)?)|([-+]?(\d{1,8}[.,])*\d+([.,]\d+)? ?($|€|EUR)?))$
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo
So, rather than requiring an exception pattern and then failing to match a fake string, it makes sense to simply use a valid match pattern and add an exception at the start.
I also see ($|€|EUR)?, you probably wanted to match a dollar symbol here. If I am right, replace it with ([$€]|EUR)?. Also, ($|EUR)? might also need replacing with ([$€]|EUR)?.
Also, consider using non-capturing groups rather than capturing ones, since you say you are only interested in full match values.

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?

Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.

The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101

Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

How to use two types of regex in single regex?

I have a string field. I need to pass UUID string or digits number to that field.
So I want to validate this passing value using regex.
sample :
stringField = "1af6e22e-1d7e-4dab-a31c-38e0b88de807";
stringField = "123654";
For UUID I can use,
"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
For digits I can use
"\\d+"
Is there any way to use above 2 pattern in single regex

Yes..you can use |(OR) between those two regex..
[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+
^

try:
"(?:[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})|(?:\\d+)"

You can group regular expressions with () and use | to allow alternatives.
So this will work:
(([0-9a-fA-F]){8}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){4}-([0-9a-fA-F]){12})|(\\d+)
Note that I've adjusted your UUID regular expression a little to allow for upper case letters.

How are you applying the regex? If you use the matches(), all you have to do is OR them together as #Anirudh said:
return myString.matches(
"[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+");
This works because matches() acts as if the regex were enclosed in a non-capturing group and anchored at both ends, like so:
"^(?:[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+)$"
If you use Matcher's find() method, you have to add the group and the anchors yourself. That's because find() returns a positive result if any substring of the string matches the regex. For example, "xyz123<>&&" would match because the "123" matches the "\\d+" in your regex.
But I recommend you add the explicit group and anchors anyway, no matter what method you use. In fact, you probably want to add the inline modifier for case-insensitivity:
"(?i)^(?:[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+)$"
This way, anyone who looks at the regex will be able to tell exactly what it's meant to do. They won't have to notice that you're using the matches() method and remember that matches() automatically anchors the match. (This will be especially helpful for people who learned regexes in a non-Java context. Almost every other regex flavor in the world uses the find() semantics by default, and has no equivalent for Java's matches(); that's what anchors are for.)
In case you're wondering, the group is necessary because alternation (the | operator) has the lowest precedence of all the regex constructs. This regex would match a string that starts with something that looks like a UUID or ends with one or more digits.
"^[\\da-f]{8}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{4}-[\\da-f]{12}|\\d+$" // WRONG

Java regex alternation operator "|" behavior seems broken

Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV" will match "IV" for "IV" and "III" for "III"
In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex to "IV|III|II|I", the behavior is corrected, but this obviously isn't a solution in general.
Is there a way to make Java choose the longest match out of an alternation group, instead of choosing the 'first'?
A code sample for clarity:
public static void main(String[] args)
{
Pattern p = Pattern.compile("six|sixty");
Matcher m = p.matcher("The year was nineteen sixty five.");
if (m.find())
{
System.out.println(m.group());
}
else
{
System.out.println("wtf?");
}
}
This outputs "six"

No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.
You can force it to continue by adding a condition after the alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($) or a word boundary (\b).
"\\b(I|II|III|IV)\\b"
EDIT: I should mention that, while grep, sed, awk and others traditionally use text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.

I think a pattern that will work is something like
IV|I{1,3}
See the "greedy quantifiers" section at http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
Edit: in response to your comment, I think the general problem is that you keep using alternation when it is not the right thing to use. In your new example, you are trying to match "six" or "sixty"; the right pattern to use is six(ty)?, not six|sixty. In general, if you ever have two members of an alternation group such that one is a prefix of another, you should rewrite the regular expression to eliminate it. Otherwise, you can't really complain that the engine is doing the wrong thing, since the semantics of alternation don't say anything about a longest match.
Edit 2: the literal answer to your question is no, it can't be forced (and my commentary is that you shouldn't ever need this behavior).
Edit 3: thinking more about the subject, it occurred to me that an alternation pattern where one string is the prefix of another is undesirable for another reason; namely, it will be slower unless the underlying automaton is constructed to take prefixes into account (and given that Java picks the first match in the pattern, I would guess that this is not the case).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex matching only prefix or only suffix (XOR) - java

Related

validate special characters by negating unicode letters with regex pattern?

Java Regex Lookahead Conditional

Why is this regex not matching URLs?

How to use two types of regex in single regex?

Java regex alternation operator "|" behavior seems broken

Categories

Resources