Why does the pattern ignore the space inside character class - java

I am trying to match some codes that are short strings with simple structure:
5 digits
Colon
Some letters
Space or underscore
Some digits.
I want to use a Pattern.COMMENT option to format my pattern:
String pat = "(?x) ([0-9]{5}) : ([a-zA-Z]+ [_ ] [0-9]+) ";
This pattern works fine at https://regex101.com/r/oW8vQ4/1.
However, in Java, this line:
"31500:STR 200".matches(pat)
yields false.
Why does it return false here? Shouldn't the [_ ] match the space even if the Pattern.COMMENT is enabled as it is inside a character class?

I think the problem is that you need to scape the space inside the character classes. From http://www.regular-expressions.info/freespacing.html
Java, however, does not treat a character class as a single token in free-spacing mode. Java does ignore whitespace and comments inside character classes. So in Java's free-spacing mode, [abc] is identical to [ a b c ]. To add a space to a character class, you'll have to escape it with a backslash. But even in free-spacing mode, the negating caret must appear immediately after the opening bracket. [ ^ a b c ] matches any of the four characters ^, a, b or c just like [abc^] would. With the negating caret in the proper place, [^ a b c ] matches any character that is not a, b or c.
Give it a try with the pattern - just added \\ before the space... but didn't test this myself.
String pat = "(?x) ([0-9]{5}) : ([a-zA-Z]+ [_\\ ] [0-9]+) ";

Related

Java Regexp to match words only (', -, space)

What is the Java Regular expression to match all words containing only :
From a to z and A to Z
The ' - Space Characters but they must not be in the beginning or the
end.
Examples
test'test match
test' doesn't match
'test doesn't match
-test doesn't match
test- doesn't match
test-test match
You can use the following pattern: ^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$
Below are the examples:
String s1 = "abc";
String s2 = " abc";
String s3 = "abc ";
System.out.println(s1.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s2.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s3.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
When you mean the whitespace char it is: [a-zA-Z ]
So it checks if your string contains a-z(lowercase) and A-Z(uppercase) chars and the whitespace chars. If not, the test will fail
Here's my solution:
/(\w{2,}(-|'|\s)\w{2,})/g
You can take it for a spin on Regexr.
It is first checking for a word with \w, then any of the three qualifiers with "or" logic using |, and then another word. The brackets {} are making sure the words on either end are at least 2 characters long so contractions like don't aren't captured. You could set that to any value to prevent longer words from being captured or omit them entirely.
Caveat: \w also looks for _ underscores. If you don't want that you could replace it with [a-zA-Z] like so:
/([a-zA-Z]{2,}(-|'|\s)[a-zA-Z]{2,})/g

Validating name string with dashes and singlequotes

I am trying to validate a string with the following specification:
"Non-empty string that contains only letters, dashes, or single quotes"
I'm using String.matches("[a-zA-Z|-|']*") but it's not catching the - characters correctly. For example:
Test Result Should Be
==============================
shouldpass true true
fail3 false false
&fail false false
pass-pass false true
pass'again true true
-'-'-pass false true
So "pass-pass" and "-'-'-pass" are failing. What am I doing wrong with my regex?
You should use the following regex:
[a-zA-Z'-]+
You regex is allowing literal |, and you have a range specified, from | to |. The hyphen must be placed at the end or beginning of the character class, or escaped in the middle if you want to match a literal hyphen. The + quantificator at the end will ensure the string is non-empty.
Another alternative is to include all Unicode letters:
[\p{L}'-]+
Java string: "[\\p{L}'-]+".
Possible solution:
[a-zA-Z-']+
Problems with your regex:
If you don't want to accept empty strings, change * to + to accept one or more characters instead of zero or more.
Characters in character class are implicitly separated by OR operator. For instance:
regex [abc] is equivalent of this regex a|b|c.
So as you see regex engine doesn't need OR operator there, which means that | will be treated as simple pipe literal:
[a|b] represents a OR | OR b characters
You seem to know that - has special meaning in character class, which is to create range of characters like a-z. This means that |-| will be treated by regex engine as range of characters between | and | (which effectively is only one character: |) which looks like main problem of your regex.
To create - literal we either need to
escape it \-
place it where - wouldn't be able to be interpreted as range. To be more precise we need to place it somewhere where it will not have access to characters which could be use as left and right range indicators l-r like:
at start of character class [- ...] (no left range character)
at end of character class [... -] (no right range character)
right after other range like A-Z-x - Z was already used as character representing end of range A-Z so it can't reused in Z-x range.
This will work:
[a-zA-Z'-]+
Using the | is going to search for a range, you just want that specific character.
Tested Here
try {
if (subjectString.matches("(?i)([a-z'-]+)")) {
// String matched entirely
} else {
// Match attempt failed
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
EXPLANATION:
(?i)([a-z'-]+)
----------
Options: Case insensitive; Exact spacing; Dot doesn't match line breaks; ^$ don't match at line breaks; Default line breaks
Match the regex below and capture its match into backreference number 1 «([a-z'-]+)»
Match a single character present in the list below «[a-z'-]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A character in the range between “a” and “z” (case insensitive) «a-z»
The literal character “'” «'»
The literal character “-” «-»

RegExp to match string formed with a limited set of characters without reusing any character

I have a bunch of characters like this: A B B C D
And I have a few spaces like this: _ _ _
Is there a way to use regular expression to match any string that can be formed by "dragging" the available characters into the empty spaces?
So in the example, these are some valid matches:
A B C
A B B
B C B
D A B
But these are invalid:
A A B // Only one 'A' is available in the set
B B B // Only two 'B's are available in the set
Sorry if it has already been asked before.
vks's solution would work properly, and here's it optimised with additions to fulfill the "_ _ _" rule:
^(?!(?:[^A]*A){2})(?!(?:[^B]*B){3})(?!(?:[^C]*C){2})(?!(?:[^D]*D){2})(?:[ABCD](?:\s|$)){3}
Here is a regex demo.
Changes from original regex:
Capturing groups are removed since we're in Java - Java regex implementation dedicates time to write captured groups during matching).
The anchor ^ is moved in front for readability of the regex.
Regex explanation:
^ Asserts position at the start of the match.
(?! Negative lookahead - Asserts that our position does not match the following, without moving the pointer:
(?:[^A]*A){2} Two "A"s (literal character), with non-"A"s rolled over in an optimal way.
) Closes the group.
(?!(?:[^B]*B){3}) Same as the above group - Asserts that there are not three "B"s in the match.
(?!(?:[^C]*C){2}) Asserts that there are not two "C"s in the match.
(?!(?:[^D]*D){2}) Asserts that there are not two "D"s in the match.
(?: Non-capturing group: Matches the following without capturing.
[ABCD] Any character from the list "A", "B", "C", or "D".
(?:\s|$) A whitespace, or the end of string.
){3} Three times - Must match the sequence exactly three times to fulfill the "_ _ _" rule.
To use the regex:
boolean fulfillsRule(String str) {
Pattern tripleRule = Pattern.compile("^(?!(?:[^A]*A){2})(?!(?:[^B]*B){3})(?!(?:[^C]*C){2})(?!(?:[^D]*D){2})(?:[ABCD](?:\s|$)){3}");
return tripleRule.matcher(str).find();
}
(?!(.*?A){2,})(?!(.*?B){3,})(?!((.*?C){2,}))(?!((.*?D){2,}))^[ABCD]*$
You can use something like this.See demo.
http://regex101.com/r/uH3fV3/1
Interesting problem, this is my idea:
(?m)^(?!.*([ACD]).*\1)(?!(?>.*?B){3})(?>[A-D] ){2}[A-D]$
Used (?m) MULTILINE modifier where ^ matches line start and $ line end.
Test at regexplanet (click on Java); regex101 (non Java)
If I understood it right, the available character-pot is A,B,B,C,D. A string should be valid, if it contains 0 or 1 of each [ACD] or 0-2 of B in your example. My pattern consists of 3 parts:
(?!.*([ACD]).*\1) Used at line-start ^ a negative lookahead to assure, that [ACD] occurs at most one time, by capturing [ACD] to \1 and checking, it does not occur twice anywhere.
(?!(?>.*?B){3}) Using a negative lookahead, to assure, B occurs at most 2x.
finally (?>[A-D] ){2}[A-D]$ determines the total usable character set, assures the formatting, where each letter must be prededed by space or start and checks the length.
This can be easily modified to other needs. Also see SO Regex FAQ

Java regex "[.]" vs "."

I'm trying to use some regex in Java and I came across this when debugging my code.
What's the difference between [.] and .?
I was surprised that .at would match "cat" but [.]at wouldn't.
[.] matches a dot (.) literally, while . matches any character except newline (\n) (unless you use DOTALL mode).
You can also use \. ("\\." if you use java string literal) to literally match dot.
The [ and ] are metacharacters that let you define a character class. Anything enclosed in square brackets is interpreted literally. You can include multiple characters as well:
[.=*&^$] // Matches any single character from the list '.','=','*','&','^','$'
There are two specific things you need to know about the [...] syntax:
The ^ symbol at the beginning of the group has a special meaning: it inverts what's matched by the group. For example, [^.] matches any character except a dot .
Dash - in between two characters means any code point between the two. For example, [A-Z] matches any single uppercase letter. You can use dash multiple times - for example, [A-Za-z0-9] means "any single upper- or lower-case letter or a digit".
The two constructs above (^ and -) are common to nearly all regex engines; some engines (such as Java's) define additional syntax specific only to these engines.
regular-expression constructs
. => Any character (may or may not match line terminators)
and to match the dot . use the following
[.] => it will matches a dot
\\. => it will matches a dot
NOTE: The character classes in Java regular expression is defined using the square brackets "[ ]", this subexpression matches a single character from the specified or, set of possible characters.
Example : In string address replaces every "." with "[.]"
public static void main(String[] args) {
String address = "1.1.1.1";
System.out.println(address.replaceAll("[.]","[.]"));
}
if anything is missed please add :)

Regex excluding square brackets

I am new to regex. I have this regex:
\[(.*[^(\]|\[)].*)\]
Basically it should take this:
[[a][b][[c]]]
And be able to replace with:
[dd[d]]
abc, d are unrelated. Needless to say the regex bit isn't working. it replaces the entire string with "d" in this case.
Any explanation or aid would be great!
EDIT:
I tried another regex,
\[([^\]]{0})\]
This one worked for the case where brackets contain no inner brackets and nothing else inside. But it doesn't work for the described case.
You need to know that . dot is special character which represents "any character beside new line mark" and * is greedy so it will try to find maximal match.
In your regex \[(.*[^(\]|\[)].*)\] first .* will represent maximal set of characters between [ and [^(\]|\[)].*)\]] and this part can be understood as non [ or ] character, optional other characters .* and finally ]. So this regex will match your entire input.
To get rid of that problem remove both .* from your regex. Also you don't need to use | or ( ) inside [^...].
System.out.println("[[a][b][[c]]]".replaceAll("\\[[^\\]\\[]\\]", "d"));
Output: [dd[d]]
\[(\[a\])(\[b\])\[(\[c\])\]\]
If you need to double backslashes in the current context (such as you are placing it in a "" style string):
\\[(\\[a\\])(\\[b\\])\\[(\\[c\\])\\]\\]
An example replacement for a, b and c is [^\]]*, or if you need to escape backslashes [^\\]]*.
Now you can replace capture one, capture two and capture three each with d.
If the string you are replacing in is not exactly of that format, then you want to do a global replacement with
(\[a\])
replacing a,
(\[[^\]]*\])
doubling backslashes,
(\\[[^\\]]*\\])
Try this:
System.out.println("[[a][b][[c]]]".replaceAll("\\[[^]\\[]]", "d"));
if a,b,c are in real world more than one character, use this:
System.out.println("[[a][b][[c]]]".replaceAll("\\[[^]\\[]++]", "d"));
The idea is to use a character class that contains all characters but [ and ]. The class is: [^]\\[] and other square brackets in the pattern are literals.
Note that a literal closing square bracket don't need to be escaped at the first position in a character class and outside a character class.

Categories

Resources