regex for single space and/or apostrophe in java

regex for single space and/or apostrophe in java - java

I'm having issues with my regex.Could anyone please help me on this? Requirement: String should be alphabetic and can include single apostrophe and or single space(size should be minimum of 2)
Valid strings:
1. 'abc
2.' abc
3.abc '
4.abc'
5.a 'bc
6.a' bc
I have used the below regex.It works for scenario 2,4,6 but doesn't work for scenario 1,3,5
Regex:
"(([a-zA-Z][a-zA-Z])| " +
"([a-zA-Z]*\\s\\'[a-zA-Z]*)|" +
"([a-zA-Z]*\\'\\s[a-zA-Z]*)|"+
"[a-zA-Z]*|" +
"[a-zA-Z]\\s|" +
"[a-zA-Z]\\'|" +
"\\s[a-zA-Z]|" +
"\\'[a-zA-Z]|"+
"\\s[a-zA-Z]*|"+
"\\'[a-zA-Z]*|" +
"[a-zA-Z]*\\s|"+
"[a-zA-Z]*\\')"

Code
Note: The link includes \r\n in the regex since the input is multiline
See regex in use here
^(?!(?:[^']*'){2})(?!(?:[^ ]* ){2})[a-z ']{2,}$
Results
Input
'abc
' abc
abc '
abc'
a 'bc
a' bc
abc
'
ab
a
a'' bc
a bc
Output
Below are only strings that match requirements.
Note: The second to last string sample is ' (apostrophe and space), which, according to the OP's requirements, should match.
'abc
' abc
abc '
abc'
a 'bc
a' bc
abc
'
ab
Explanation
^ Assert position at the start of the line
(?!(?:[^']*'){2}) Negative lookahead ensuring what follows doesn't include 2 apostraphes '
(?!(?:[^ ]* ){2}) Negative lookahead ensuring what follows doesn't include 2 spaces
[a-z ']{2,} Match two or more of the characters in the set
$ Assert position at the end of the line

Based on the provided explanation there is something smaller that you can do to fit the provided email.
^( |'|[a-zA-Z]){2,}
First value ^ will detect if it is a starting element and not a
substring.
Then we enclose in the parentheses all the possible
values, so, you can have a space, you can have an apostrophe or you
can have an a alphabetic string.
Finally, we consider than the
elements needs to have at least 2 characters, that's done with {2,}
If you want to limit the max length, you just need to add the number
in the right side. eg {2,10}.

Related

Java regex shortest match

I have the following string, (a.1) (b.2) (c.3) (d.4). I want to change it to (1) (2) (3) (4). I use the following method.
str.replaceAll("\(.*[.](.*)\)","($1)"). And I only get (4). What is the correct method?
Thanks

Couple things here. First, your escapes for the parentheses are incorrect. In Java string literals, backslash itself is an escape character, meaning you need to use \\( to represent \( in regex.
I think your question is how to do non-greedy matches in regex. Use ? to specify non-greedy matching; e.g. *? means "zero or more times, but as few times as possible".
This doesn't negate other answers, but they depend on your test input being as simple as it is in your question. This gives me the correct output without changing the spirit of your original regex (that only the parentheses and dot delimiter are known to be present):
String test = "(a.1) (b.2) (c.3) (d.4)";
String replaced = test.replaceAll("\\(.*?[.](.*?)\\)", "($1)");
System.out.println(replaced); // "(1) (2) (3) (4)"

Root cause
You want to match ()-delimited substrings, but are using .* greedy dot pattern that can match any 0 or more chars (other than line break chars). The \(.*[.](.*)\) pattern will match the first ( in (a.1) (b.2) (c.3) (d.4), then .* will grab the whole string, and backtracking will start trying to accommodate text for the subsequent obligatory subpatterns. [.] will find the last . in the string, the one before the last digit, 4. Then, (.*) will again grab all the rest of the string, but since the ) is required right after, due to backtracking the last (.*) will only capture 4.
Why is lazy / reluctant .*? not a solution?
Even if you use \(.*?[.](.*?)\), if there are (xxx) like substrings inside the string, they will get matched together with expected matches, as . matches any char but line break chars.
Solution
.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)")
See the regex demo. The [^()] will only match any char BUT a ( and ).
Details
\( - a ( char
[^()]* - a negated character class matching 0 or more chars other than ( and )
\. - a dot
([^()]*) - Group 1 (its value is later referred to with $1 from the replacement pattern): any 0+ chars other than ( and )
\) - a ) char.
Java demo:
List<String> strs = Arrays.asList("(a.1) (b.2) (c.3) (d.4)", "(a.1) (xxxx) (b.2) (c.3) (d.4)");
for (String str : strs)
System.out.println("\"" + str.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)") + "\"");
Output:
"(1) (2) (3) (4)"
"(1) (xxxx) (2) (3) (4)"

try this one, it will match any alphabets, . and " and replace them all with empty ""
str.replaceAll("[a-zA-Z\\.\"]", "")
Edit:
You can use also [^\\d)(\\s] to match all characters that are not number, space and )( and replace them all with empty "" string
String str = "(a.1) (b.2) (c.3) (d.4)";
System.out.println(str.replaceAll("[^\\d)(\\s]",""));

Try this
str.replaceAll("[A-Za-z0-9]+\.","");
[A-Za-z0-9] will match the upper case, lower case and digits. If you want to match anything before the dot(.) you can use .+ or .* in the place of [A-Za-z0-9]+

Optimization of a regex for a Java identifier. Separating the number in the ending and the other part

I need to read a string as valid Java identifier and to get separately the number in the ending (if there is any) and the start part.
a1 -> a,1
a -> a,
a123b -> a123b,
ab123 -> ab, 123
a123b456 -> a123b, 456
a123b456c789 -> a123b456c, 789
_a123b456c789 -> _a123b456c, 789
I had written a pair of regex that I have tested on http://www.regexplanet.com/advanced/java/index.html, and they look to work OK
([a-zA-Z_][a-zA-Z0-9_]*[a-zA-Z_]|[a-zA-Z_])(\d+)$
([a-zA-Z_](?:[a-zA-Z0-9_]*[a-zA-Z_])?)(\d+)$
How can I shorten them? Or can you advice another regex?
I can't change [a-zA-Z_] for \w, for the last takes digits, too.
(We are talking on regex strings BEFORE replacement \ for \\ in Java/Groovy)

The Incremental Java says:
Each identifier must have at least one character.
The first character must be picked from: alpha, underscore, or dollar sign. The first character can not be a digit.
The rest of the characters (besides the first) can be from: alpha, digit, underscore, or dollar sign. In other words, it can be any valid identifier character.
Put simply, an identifier is one or more characters selected from alpha, digit, underscore, or dollar sign. The only restriction is the first character can't be a digit.
And the Java docs also add:
The convention, however, is to always begin your variable names with a letter, not "$" or "_". Additionally, the dollar sign character, by convention, is never used at all.
You may use this one that can be used to match any valid variable and put the starting chunk of chars into one group and all the trailing digits into another group:
^(?!\d)([$\w]+?)(\d*)$
See the regex demo
Or this one that will only match the identifiers that follow the convention:
^(?![\d_])(\w+?)(\d*)$
See this regex demo
Details:
^ - start of string
(?!\d) - the first char cannot be a digit ((?![\d_]) will fail the match if the first char is digit or _)
([$\w]+?) - Group 1: one or more word or $ chars (the (\w+?) will just match letters/digit/_ chars), as few as possible (as the +? is a lazy quantifier) up to the first occurrence of...
(\d*)$ - Group 2: zero or more digits at the end of string ($).
Groovy demo:
// Non-convention Java identifier
def res = 'a123b$456_c789' =~ /^(?!\d)([$\w]+?)(\d*)$/
print("${res[0][1]} : ${res[0][2]}") // => a123b$456_c : 789
// Convention Java identifier
def res2 = 'a123b456_c' =~ /^(?!\d)([$\w]+?)(\d*)$/
print("${res2[0][1]} : ${res2[0][2]}") // => a123b456_c :

EDIT: I tried to make my solution as simple as I could but I didn't think about it long enough so it is incorrect. Just look at the accepted answer
I believe you can shorten it to ^([a-zA-Z_][a-zA-Z_\d]*[^\d])(\d*)$ - match all possible characters with not a number at the end, and a number.

Match consecutive single characters as whole word

While filtering from list of strings, i want to match consecutive single characters as whole word
e.g. below strings
'm g road'
'some a b c d limited'
in first case should match if user types
"mg" or "m g" or "m g road" or "mg road"
in second case should match if user types
"some abcd" or "some a b c d" or "abcd" or "a b c d"
How i can do that, can i achieve this using regex?
Order of whole words i can handle right now using searching words one by one, but not sure how to treat consecutive single chars as single word
e.g. "mg road" or "road mg" i can handle by searching "mg" and "road" one by one
EDIT
For making requirement more clear, below is my test case
#Test
public void testRemoveSpaceFromConsecutiveSingleCharacters() throws Exception {
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("some a b c d limited").equals("some abcd limited"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("m g road").equals("mg road"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c").equals("bank abc"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("bank a b c limited n a").equals("bank abc limited na"));
Assert.assertTrue(Main.removeSpaceFromConsecutiveSingleCharacters("c road").equals("c road"));
}

1.) Strip out spaces within space-surrounded single letters from stringtocheck and userinput.
.replaceAll("(?<=\\b\\w) +(?=\\w\\b)","")
(?<=\b\w) look behind to check if preceded by \b word boundary, \w word character
(?=\\w\\b) look ahead to check if followed by \w word character, \b word boundary
See demo at regexplanet (click Java)
2.) Check if stringtocheck .contains userinput.

Sounds like you simply want to ignore white space. You can easily can do this by stripping out white space from both the target string and the user input before looking for a match.

You're basically wanting each search term to be modified to allow intervening spaces, so
"abcd" becomes regex "\ba ?b ?c ?d\b"
To achieve this, do this to each word before matching:
word = "\\b" + word.replaceAll("(?<=.)(?=.)", " ?") + "\\b";
The word breaks \b are necessary to stop matching "comma bcd" or "abc duck".

This regex will match all single characters separated by one or more spaces
(^(\w\s+)+)|(\s+\w)+$|((\s+\w)+\s+)

The following regex (in multiline mode) could help you out:
^(?<first>\w+)(?<chars>(?:.(?!(?:\b\w{2,}\b)))*)
# assure that it is the beginning of the line
# capture as many word characters as possible in the first group "first"
# the construction afterwards consumes everything up to (not including)
# a word which has at least two characters...
# ... and saves it to the group called "chars"
You would only need to replace the whitespaces in the second group (aka "chars").
See a demo on regex101.com

str = str.replaceAll("\\s","");

Unique regex for first name and last name

I have a single input where users should enter name and surname. The problem is i need to use checking regEx. There's a list of a requirements:
The name should start from Capital Letter (not space)
There can't be space stacks
It's obligate to support these Name and Surname (all people are able to write theirs first/name). Example:
John Smith
and
Armirat Bair Hossan
And the last symbol shouldn't be space.
Please help,
ATM i have regex like
^\\p{L}\\[p{L} ,.'-]+$
but it denies ALL input, which is not good
Thanks for helping me
UPDATE:
CORRECT INPUT:
"John Smith"
"Alberto del Muerto"
INCORRECT
" John Smith "
" John Smith"

You can use
^[\p{Lu}\p{M}][\p{L}\p{M},.'-]+(?: [\p{L}\p{M},.'-]+)*$
or
^\p{Lu}\p{M}*+(?:\p{L}\p{M}*+|[,.'-])++(?: (?:\p{L}\p{M}*+|[,.'-])++)*+$
See the regex demo and demo 2
Java declaration:
if (str.matches("[\\p{Lu}\\p{M}][\\p{L}\\p{M},.'-]+(?: [\\p{L}\\p{M},.'-]+)*")) { ... }
// or if (str.matches("\\p{Lu}\\p{M}*+(?:\\p{L}\\p{M}*+|[,.'-])++(?: (?:\\p{L}\\p{M}*+|[,.'-])++)*+")) { ... }
The first regex breakdown:
^ - start of string (not necessary with matches() method)
[\p{Lu}\p{M}] - 1 Unicode letter (incl. precomposed ones as \p{M} matches diacritics and \p{Lu} matches any uppercase Unicode base letter)
[\p{L}\p{M},.'-]+ - matches 1 or more Unicode letters, a ,, ., ' or - (if 1 letter names are valid, replace + with - at the end here)
(?: [\p{L}\p{M},.'-]+)* - 0 or more sequences of
- a space
[\p{L}\p{M},.'-]+ - 1 or more characters that are either Unicode letters or commas, or periods, or apostrophes or -.
$ - end of string (not necessary with matches() method)
NOTE: Sometimes, names contain curly apostrophes, you can add them to the character classes ([‘’]).
The 2nd regex is less effecient but is more accurate as it will only match diacritics after base letters. See more about matching Unicode letters at regular-expressions.info:
To match a letter including any diacritics, use \p{L}\p{M}*+.

Try this one
^[^- '](?=(?![A-Z]?[A-Z]))(?=(?![a-z]+[A-Z]))(?=(?!.*[A-Z][A-Z]))(?=(?!.*[- '][- ']))[A-Za-z- ']{2,}$
There is also an interactive Demo of this pattern available at an external website.

You made a typo: the second \\ should be in front of p.
However even then there is a check missing for a trailing space
"^\\p{L}[\\p{L} ,.'-]+$"
For a .matches the following would suffice
"\\p{L}[\\p{L} ,.'-]*[\\p{L}.]"
Names like "del Rey, Hidalgo" do not require an initial capital.
Also I would advise to simply .trim() the input; imagine a user regarding at the input being rejected for a spurious blank.

Try this
^[A-Z][a-z]+(([\s][A-Z])?[a-z]+){1,2}$
but use \\ instead \ for java

Correct existing regular expression / create a new one

I am trying to learn Regular expressions and am trying to replace values in a string with white-spaces using regular expressions to feed it into a tokenizer. The string might contain many punctuations. However, I do not want to replace whitespaces in string which contain an apostrophe/ hyphen within them.
For example,
six-pack => six-pack
He's => He's
This,that => This That
I tried to replace all the punctuations with whitespace initially but that would not work.
I tried to replace only those punctuations by specifying the wordboundaries as in
\B[^\p{L}\p{N}\s]+\B|\b[^\p{L}\p{N}\s]+\B|\B[^\p{L}\p{N}\s]+\b
But, I am not able to exclude the hyphen and apostrophe from them.
My guess is that the above regex is also very cumbersome and there should be a better way. Is there any?
So, all I am trying to do is:
Replace all punctuations with whitespace
Do not do the above if they are hyphen/apostrophe
Do replace if the hyphen/apostrophe does occur at start/end of a word.
Any help is appreciated.

You can probably work out a set of punctuation characters that are ok between words, and another set that isn't, then define your regular expression based on that.
For instance:
String[] input = {
"six-pack",// => six-pack
"He's",// => He's
"This,that"// => This That"
};
for (String s: input) {
System.out.println(s.replaceAll("(?<=\\w)[\\p{Punct}&&[^'-]](?=\\w)", " "));
}
Output
six-pack
He's
This that
Note
Here I'm defining the Pattern by using a character class including all posix for punctuation, preceded and followed by a word character, but negating a character class containing either ' or -.

You can use this lookahead based regex:
(?!((?!^)['-].))\\p{Punct}
RegEx Demo

You could use negative lookahead assertion like below,
String s = "six-pack\n"
+ "He's\n"
+ "This,that";
System.out.println(s.replaceAll("(?m)^['-]|['-]$|(?!['-])\\p{Punct}", " "));
Output:
six-pack
He's
This that
Explanation:
(?m) Multiline Mode
^['-] Matches ' or - which are at the start.
| OR
['-]$ Matches ' or - which are at the end of the line.
| OR
(?!['-])\\p{Punct} Matches all the punctuations except these two ' or - . It won't touch the matched [-'] symbols (ie, at the start and end).
RegEx Demo

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex for single space and/or apostrophe in java - java

Related

Java regex shortest match

Optimization of a regex for a Java identifier. Separating the number in the ending and the other part

Match consecutive single characters as whole word

Unique regex for first name and last name

Correct existing regular expression / create a new one

Categories

Resources