Regex with at least one of two options but in sequence - java

I am trying to write a regex (for use in a Java Pattern) that will match strings that possibly have a letter that is possibly followed by a space then number, but must have at least one of them. For example, the following strings should be matched:
"a 5"
"b 9"
" 8"
However, it should not match an empty string ("").
Furthermore, I would like to make each of the components part a named capture group.
The following works, but allows the empty string.
"(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"

To ensure that there is at least one of them, you can use lookahead (?=\\p{Alpha}| \\p{Digit}) at the beginning:
"(?=\\p{Alpha}| \\p{Digit})(?<let>\\p{Alpha})?( (?<num>\\p{Digit}))?"
In general, to avoid empty strings you can use (?=.).

You can use a negative lookahead to avoid empty input and keep your regex as:
^(?!$)(?<let>\p{L})?(?:\h+(?<num>\p{N}))?$
RegEx Demo
(?!$) is negative lookahead to fail the match for empty strings.

You can solve problem with:
([a-z]? \d)|([a-z] \d?)
You can see this code that covers your test cases in demo here. You can see this code in demo here. This is very basic regular expression knowledge, you should definitely learn more about regular expressions, there are bunch of good tutorials on web (e.g this one).

You can use | for or, then simply repeat "any pattern" to match everything like this.
((?<let>[A-z])|(?<num>\d)\s*)+
That lets you match any number of named patterns in any order.

Related

Why is this regex not matching URLs?

I have the following regex:
^(?=\w+)(-\w+)(?!\.)
Which I'm attempting to match against the following text:
www-test1.examples.com
The regex should match only the -test1 part of the string and only if it is before the first .and after the start of the expression. www can be any string but it should not be matched.
My pattern is not matching the -test1 part. What am I missing?
Java is one of the only languages that support non-fixed-length look-behinds (which basically means you can use quantifiers), so you can technically use the following:
(?<=^\w+)(-\w+)
This will match for -test without capturing the preceding stuff. However, it's generally not advisable to use non-fixed-length look-behinds, as they are not perfect, nor are they very efficient, nor are they portable across other languages. Having said that.. this is a simple pattern, so if you don't care about portability, sure, go for it.
The better solution though is to group what you want to capture, and reference the captured group (in this case, group 1):
^\w+(-\w+)
p.s. - \w will not match a dot, so no need to look ahead for it.
p.p.s. - to answer your question about why your original pattern ^(?=\w+)(-\w+)(?!\.) doesn't match. There are 2 reasons:
1) you start out with a start of string assertion, and then use a lookahead to see if what follows is one or more word chars. But lookaheads are zero-width assertions, meaning no characters are actually consumed in the match, so the pointer doesn't move forward to the next chars after the match. So it sees that "www" matches it, and moves on to the next part of the pattern, but the actual pointer hasn't moved past the start of string. So, it next tries to match your (-\w+) part. Well your string doesn't start with "-" so the pattern fails.
2) (?!\.) is a negative lookahead. Well your example string shows a dot as the very next thing after your "-test" part. So even if #1 didn't fail it, this would fail it.
The problem you're having is the lookahead. In this case, it's inappropriate if you want to capture what's between the - and the first .. The pattern you want is something like this:
(-\w+)(?=\.)
In this case, the contents of capture group 1 will contain the text you want.
Demo on Regex101
Try this:
(?<=www)\-\w+(?=\.)
Demo: https://regex101.com/r/xEpno7/1

Regular expression non-greedy but still

I have some larger text which in essence looks like this:
abc12..manycharshere...hi - abc23...manyothercharshere...jk
Obviously there are two items, each starting with "abc", the numbers (12 and 23) are interesting as well as the "hi" and "jk" at the end.
I would like to create a regular expression which allows me to parse out the numbers, but only if the two characters at the end match, i.e. I am looking for the number related to "jk", but the following regular expression matches the whole string and thus returns "12", not "23" even when non-greedy matching the area with the following:
abc([0-9]+).*?jk
Is there a way to construct a regular expression which matches text like the one above, i.e. retrieving "23" for items ending in "jk"?
Basically I would need something like "match abc followed by a number, but only if there is "jk" at the end before another instance of "abc followed by a number appears"
Note: the texts/matches are an abstraction here, the actual text is more complicated, espially the things that can appear as "manyothercharactershere", I simplified to show the underlying problem more clearly.
Use a regex like this. .*abc([0-9]+).*?jk
demo here
I think you want something like this,
abc([0-9]+)(?=(?:(?!jk|abc[0-9]).)*jk)
DEMO
You need to use negative lookahead here to make it work:
abc(?!.*?abc)([0-9]+).*?jk
RegEx Demo
Here (?!.*?abc) is negative lookahead that makes sure to match abc where it is NOT followed by another abc thus making sure closes string between abc and jk is matched.
Being non-greedy does not change the rule, that the first match is returned. So abc([0-9]+).*?jk will find the first jk after “abcnumber” rather than the last one, but still match the first “abcnumber”.
One way to solve this is to tell that the dot should not match abc([0-9]+):
abc([0-9]+)((?!abc([0-9]+)).)*jk
If it is not important to have the entire pattern being an exact match you can do it simpler:
.*(abc([0-9]+).*?jk)
In this case, it’s group 1 which contains your intended match. The pattern uses a greedy matchall to ensure that the last possible “abcnumber” is matched within the group.
Assuming that hyphen separates "items", this regex will capture the numbers from the target item:
abc([0-9]+)[^-]*?jk
See demo

How check that string is NOT blank by java regular expression?

There is regular expression for finding blank string and I want only negation. I also see this question but it does not work for java (see examples). Solution also not work for me (see 3-rd line in example).
For example
Pattern.compile("/^$|\\s+/").matcher(" ").matches() - false
Pattern.compile("/^$|\\s+/").matcher(" a").matches()- false
Pattern.compile("^(?=\\s*\\S).*$").matcher("\t\n a").matches() - false
return false in both cases.
P.S. If something is not clear ask me questions.
UPDATED
I want to use this regular expression in #Pattern annotation without creating custom annotation and programmatic validator for it. That's why I want a "plain" regexp solution without using find function.
It's not clear what you mean by negation.
If you mean "a string that contains at least one non-blank character," then you can use this:
Pattern.compile("\\S").matcher(str).find()
If it's really necessary to use matches, then you can do it with this.
Pattern.compile("\\A\\s*\\S.*\\Z").matcher(str).matches()
This just matches 0 or more spaces followed by a non-space followed by any characters at all up to the end of the string.
If you mean "a string that is all non-blank with at least one such character," then you can use this:
Pattern.compile("\\A\\S+\\Z").matcher(str).matches()
You need to study the Java regex syntax. In Java, regular expressions are compiled from strings, so there's no need for special delimiters like /.../ or %r{...} as you'll see in other languages.
How about this:
if(!string.trim().isEmpty()) {
// do something
}
Use regex \s : A whitespace character: \t\n\x0B\f\r.
Pattern.compile("\\s")

Regex index 0 how it exactly works

By compiling the following:
System.out.println(Pattern.matches(".?(\\d)$","3"));
It returns true because before 3 there is nothing and ? check for a one or zero.
However 3 is already the first character of the input which starts at 0 and end at 1. How can the jvm recognize that there is nothing before 3.
For example the following.
System.out.println(Pattern.matches(".*","hello");
It returns true as well but only the very last character gets matched with "nothing".
There should not be a "nothing" character at the beginning of a string, only at the end of it right?
This is not really about the JVM. This is about Java regular expressions.
The regular expression ".*" means "match 0 or more characters". It's easy to satisfy this, since a blank string has 0 characters, and therefore satisfies this. Whether Java regular expressions will choose to be lazy and match an empty string, or to be greedy and match the entire string depends on the implementation of Java regular expressions. If you read this excellent writeup (http://docs.oracle.com/javase/tutorial/essential/regex/quant.html) you can see that patterns like ".*" in Java are considered "reluctant" quantifiers and will prefer to take as little as possible.
Based on the information in that writeup, you can see that a pattern like ".{0,}" is a greedy version of the same expression. Perhaps you'd like to use that instead if this is truly a problem for you.
You are not interpreting your regex correctly. There is no such thing as a "nothing character" . Rather, your pattern reads: any charachter followed by a digit at the end of the string OR a digit at the end of the string.
And surely, "3" fits the second description very well.
matches method tries to match the input exactly.
so there's no need to use ^,$..

Why doesn't this Java regular expression work?

I need to create a regular expression that allows a string to contain any number of:
alphanumeric characters
spaces
(
)
&
.
No other characters are permitted. I used RegexBuddy to construct the following regex, which works correctly when I test it within RegexBuddy:
\w* *\(*\)*&*\.*
Then I used RegexBuddy's "Use" feature to convert this into Java code, but it doesn't appear to work correctly using a simple test program:
public class RegexTest
{
public static void main(String[] args)
{
String test = "(AT) & (T)."; // Should be valid
System.out.println("Test string matches: "
+ test.matches("\\w* *\\(*\\)*&*\\.*")); // Outputs false
}
}
I must admit that I have a bit of a blind spot when it comes to regular expressions. Can anyone explain why it doesn't work please?
That regular expression tests for any amount of whitespace, followed by any amount of alphanumeric characters, followed by any amount of open parens, followed by any amount of close parens, followed by any amount of ampersands, followed by any amount of periods.
What you want is...
test.matches("[\\w \\(\\)&\\.]*")
As mentioned by mmyers, this allows the empty string. If you do not want to allow the empty string...
test.matches("[\\w \\(\\)&\\.]+")
Though that will also allow a string that is only spaces, or only periods, etc.. If you want to ensure at least one alpha-numeric character...
test.matches("[\\w \\(\\)&\\.]*\\w+[\\w \\(\\)&\\.]*")
So you understand what the regular expression is saying... anything within the square brackets ("[]") indicates a set of characters. So, where "a*" means 0 or more a's, [abc]* means 0 or more characters, all of which being a's, b's, or c's.
Maybe I'm misunderstanding your description, but aren't you essentially defining a class of characters without an order rather than a specific sequence? Shouldn't your regexp have a structure of [xxxx]+, where xxxx are the actual characters you want ?
The difference between your Java code snippet and the Test tab in RegexBuddy is that the matches() method in Java requires the regular expression to match the whole string, while the Test tab in RegexBuddy allows partial matches. If you use your original regex in RegexBuddy, you'll see multiple blocks of yellow and blue highlighting. That indicates RegexBuddy found multiple partial matches in your string. To get a regex that works as intended with matches(), you need to edit it until the whole test subject is highlighted in yellow, or if you turn off highlighting, until the Find First button selects the whole text.
Alternatively, you can use the anchors \A and \Z at the start and the end of your regex to force it to match the whole string. When you do that, your regex always behaves in the same way, whether you test it in RegexBuddy, or whether you use matches() or another method in Java. Only matches() requires a full string match. All other Matcher methods in Java allow partial matches.
the regex
\w* *\(*\)*&*\.*
will give you the items you described, but only in the order you described, and each one can be as many as wanted. So "skjhsklasdkjgsh((((())))))&&&&&....." works, but not mixing the characters.
You want a regex like this:
\[\w\(\)\&\.]+\
which will allow a mix of all characters.
edit: my regex knowledge is limited, so the above syntax may not be perfect.

Categories

Resources