regex and java : Generating a regex for a String [duplicate]

regex and java : Generating a regex for a String [duplicate] - java

I have a String like - "Bangalore,India=Karnataka". From this String I would like to extract only the substring "Bangalore". In this case the regex can be - (.+),.*=.*. But the problem is, the String can sometimes come like only "Bangalore". Then in that case the above regex wont work. What will be the regex to get the substring "Bangalore" whatever the String be ?

Try this one:
^(.+?)(?:,.*?)?=.*$
Explanation:
^ # Begining of the string
( # begining of capture group 1
.+? # one or more any char non-greedy
) # end of group 1
(?: # beginig of NON capture group
, # a comma
.*? # 0 or more any char non-greedy
)? # end of non capture group, optional
= # equal sign
.* # 0 or more any char
$ # end of string
Updated:
I thougth OP have to match Bangalore,India=Karnataka or Bangalore=Karnataka but as farr as I understand it is Bangalore,India=Karnataka or Bangalore so the regex is much more simpler :
^([^,]+)
This will match, at the begining of the string, one or more non-comma character and capture them in group 1.

matcher.matches()
tries to match against the entire input string. Look at the javadoc for java.util.regex.Matcher. You need to use:
matcher.find()

Are you somehow forced to solve this using one regexp and nothing else? (Stupid interview question? Extremely inflexible external API?) In general, don't try to make regexes do what plain old programming constructs do better. Just use the obvious regex, and it it doesn't match, return the entire string instead.

Try this regex, This will grab any grouping of characters at the start followed by a comma but not the comma itself.
^.*(?=,)

If you are only interested to check that "Bangalore" is contained in the string then you don't need a regexp for this.
Python:
In [1]: s = 'Bangalorejkdjiefjiojhdu'
In [2]: 'Bangalore' in s
Out[2]: True

Related

Regex pattern matching with multiple strings

Forgive me. I am not familiarized much with Regex patterns.
I have created a regex pattern as below.
String regex = Pattern.quote(value) + ", [NnoneOoff0-9\\-\\+\\/]+|[NnoneOoff0-9\\-\\+\\/]+, "
+ Pattern.quote(value);
This regex pattern is failing with 2 different set of strings.
value = "207e/160";
Use Case 1 -
When channelStr = "207e/160, 149/80"
Then channelStr.matches(regex), returns "true".
Use Case 2 -
When channelStr = "207e/160, 149/80, 11"
Then channelStr.matches(regex), returns "false".
Not able to figure out why? As far I can understand it may be because of the multiple spaces involved when more than 2 strings are present with separated by comma.
Not sure what should be correct pattern I should write for more than 2 strings.
Any help will be appreciated.

If you print your pattern, it is:
\Q207e/160\E, [NnoneOoff0-9\-\+\/]+|[NnoneOoff0-9\-\+\/]+, \Q207e/160\E
It consists of an alternation | matching a mandatory comma as well on the left as on the right side.
Using matches(), should match the whole string and that is the case for 207e/160, 149/80 so that is a match.
Only for this string 207e/160, 149/80, 11 there are 2 comma's, so you do get a partial match for the first part of the string, but you don't match the whole string so matches() returns false.
See the matches in this regex demo.
To match all the values, you can use a repeating pattern:
^[NnoeOf0-9+/-]+(?:,\h*[NnoeOf0-90+/-]+)*$
^ Start of string
[NnoeOf0-9\\+/-]+
(?: Non capture group
,\h* Match a comma and optional horizontal whitespace chars
[NnoeOf0-90-9\\+/-]+ Match 1+ any of the listed in the character class
)* Close the non capture group and optionally repeat it (if there should be at least 1 comma, then the quantifier can be + instead of *)
$ End of string
Regex demo
Example using matches():
String channelStr1 = "207e/160, 149/80";
String channelStr2 = "207e/160, 149/80, 11";
String regex = "^[NnoeOf0-9+/-]+(?:,\\h*[NnoeOf0-90+/-]+)*$";
System.out.println(channelStr1.matches(regex));
System.out.println(channelStr2.matches(regex));
Output
true
true
Note that in the character class you can put - at the end not having to escape it, and the + and / also does not have to be escaped.

You can use regex101 to test your RegEx. it has a description of everything that's going on to help with debugging. They have a quick reference section bottom right that you can use to figure out what you can do with examples and stuff.
A few things, you can add literals with \, so \" for a literal double quote.
If you want the pattern to be one or more of something, you would use +. These are called quantifiers and can be applied to groups, tokens, etc. The token for a whitespace character is \s. So, one or more whitespace characters would be \s+.
It's difficult to tell exactly what you're trying to do, but hopefully pointing you to regex101 will help. If you want to provide examples of the current RegEx you have, what you want to match and then the strings you're using to test it I'll be happy to provide you with an example.

^(?:[NnoneOoff0-9\\-\\+\\/]+ *(?:, *(?!$)|$))+$
^ Start
(?: ... ) Non-capturing group that defines an item and its separator. After each item, except the last, the separator (,) must appear. Spaces (one, several, or none) can appear before and after the comma, which is specified with *. This group can appear one or more times to the end of the string, as specified by the + quantifier after the group's closing parenthesis.
Regex101 Test

Regex match certain IDs

I need help with some tricky regex to solve (for me!) and hope I can learn something to write some myself in the future.
I need to match all of the following IDs:
#1
#12
#123
#1234
#5069
#316&
#316.
#316;
and do not want to match leading zeros and numbers that end with ] or [ or are between ().
#0155
#0000155
#1123]
#1123[
(#1125)
I have come up with something like this: (#[1-9]\d{0,}), but it matches all of the above. So, I tried a different stuff like:
"(#[1-9]\\d{0,})([\\s,<\\.:&;\\)])"
"(#[1-9]+)([\\s,<\\.])"
"(?m)(#[1-9]+)(.,\(,\))"
But what I really want to do is (#[1-9]\d{0,}) to match all numbers BUT NOT FOLLOWING [ OR ] OR ( OR ).
How can I express something like this in a regex?
P.S.: The Regex needs to be used in Java.
Maybe someone can help to solve this, even better if he can explain how he got the way to the solution, so i can learn something new and help others when they struggle with the same problem.
kind regards!

You can use the following solution:
#[1-9]\d*(?![\[\])])\b[&.;]?
See demo
REGEX:
# - Matches # literally
[1-9] - 1 digit from 1 to 9
\d* - 0 or more digits
(?![\[\])]) - A negative lookahead making sure there is no [, ] or ) after the digits
\b - A word boundary
[&.;]? - An optional (?) character group matching &, . or ; literally.
Sample code:
String str = "#1\n#12\n#123\n#1234\n#5069\n#316&\n#316.\n#316;\nand not matches (leading zeros) and numbers that end with ] or [ or are between ().\n\n#0155\n#0000155\n#1123]\n#1123[\n(#1125)";
String rx = "#[1-9]\\d*(?![\\[\\])])\\b[&.;]?";
Pattern ptrn = Pattern.compile(rx);
Matcher m = ptrn.matcher(str);
while (m.find()) {
System.out.println(m.group(0));
}
See IDEONE demo
UPDATE
You can achieve the expected results with atomic grouping that prevents the regex engine from backtracking into it.
String rx = "#(?>[1-9]\\d*)(?![\\[\\])])[^\\w&&[^\n]]?";
In plain words, the check for brackets will be performed only after the last digit matched. See updated demo.
The [^\\w&&[^\n]]? pattern optionally matches any non-alphanumeric character but a newline. The newline is excluded from the character class using a character class intersection technique.

You may use possesive quantifier.
"#[1-9]\\d*+(?![\\[\\])])"
\\d*+ matches all zero or more character greedily and the + eixts after * won't let the regex engine to backtrack.
Add an optional \\W, if you want to match also the following non-word character.
"#[1-9]\\d*+(?![\\[\\])])\\W?"
DEMO

I am not able to test this in Java at the moment, but how about
"^#[1-9][0-9]*[&.;]?$"
(Any string starting with a '#', then a character from 1-9, then zero or more characters from 0-9, then a '&', '.' or ';' or nothing, end string)
EDIT: This only works if every id to check is in its own string, otherwise you'd need one of examples from other answers.

How to match the middle character in a string with regex?

In an odd number length string, how could you match (or capture) the middle character?
Is this possible with PCRE, plain Perl or Java regex flavors?
With .NET regex you could use balancing groups to solve it easily (that could be a good example). By plain Perl regex I mean not using any code constructs like (??{ ... }), with which you could run any code and of course do anything.
The string could be of any odd number length.
For example in the string 12345 you would want to get the 3, the character at the center of the string.
This is a question about the possibilities of modern regex flavors and not about the best algorithm to do that in some other way.

With PCRE and Perl (and probably Java) you could use:
^(?:.(?=.*?(?(1)(?=.\1$))(.\1?$)))*(.)
which would capture the middle character of odd length strings in the 2nd capturing group.
Explained:
^ # beginning of the string
(?: # loop
. # match a single character
(?=
# non-greedy lookahead to towards the end of string
.*?
# if we already have captured the end of the string (skip the first iteration)
(?(1)
# make sure we do not go past the correct position
(?= .\1$ )
)
# capture the end of the string +1 character, adding to \1 every iteration
( .\1?$ )
)
)* # repeat
# the middle character follows, capture it
(.)

Hmm, maybe someone can come up with a pure regex solution, but if not you could always dynamically build the regex like this:
public static void main(String[] args) throws Exception {
String s = "12345";
String regex = String.format(".{%d}3.{%d}", s.length() / 2, s.length() / 2);
Pattern p = Pattern.compile(regex);
System.out.println(p.matcher(s).matches());
}

Regexp: Specific characters in the text

My goal is to validate specific characters (*,^,+,?,$,[],[^]) in the some text, like:
?test.test => true
test.test => false
test^test => true
test:test => false
test-test$ => true
test-test => false
I've already created regex regarding to requirment above, but I am not sure in this.
^(.*)([\[\]\^\$\?\*\+])(.*)$
Will be good to know whether it can be optimized in such way.

Your regex is already optimized one as its very simple. You can make is much simpler or readable only.
Also if you use the matches() method of Java's String class then you'll not require the ^ and $ at the both ends.
.*([\\[\\]^$?*+]).*
Double slashes(\\) for Java, otherwise please use single slash(\).
Look, I have removed the captures () along with escape character \ for the characters ^$?*+ as they are inside the character class [].

TL;DR
The quickest regex to do the job is
# ^[^\]\[^$?*+]*([\]\[^$?*+])
^ #start of the string
[^ #any character BUT...
\]\[^$?*+ #...these ones (^$?*+ aren't special inside a character class)
]*+ #zero or more times (possessive quantifier)
([ #capture any of...
\]\[^$?*+ #...these characters
])
Be careful that in a java string, you need to escape the \ as well, so you should transform every \ into \\.
Discussion
At first two regex come in mind:
[\]\[^$?*+], which will match only the character you want inside the string.
^.*[\]\[^$?*+], which will match your string up to the desired character.
It's actually important performance-wise to understand the difference between the case with .* at the beginning and the one with no wildcard at all.
When searching for the pattern, the first .* will make the regex engine eat all the string, then backtrack character by character to see if it's a match for your character range [...]. So the regex will actually search from the end of the string.
This is an advantage when your wanted sign if near the end, a disadvantage when it is at the beginning.
On the other case, the regex engine will try every character, beginning from the left, until it matches what you want.
You can see what I mean with these two examples from the excellent regex101.com:
with the .*, match is found in 26 steps when near the beginning, 8 when it's near the beginning: http://regex101.com/r/oI3pS1/#debugger
without it, it is found in 5 steps when near the beginning and 23 when near the end
Now, if you want to combine these two approaches you can use the tl;dr answer: you eat everything that isn't your character, then you match your character (or fail if there isn't one).
On our example, it takes 7 steps wherever your character is in the string (and 7 steps even if there is no character, thanks to the possessive quantifier).

That should also work:
String regex = ".*[\\[\\]^$?*+].*";
String test1 = "?test.test";
String test2 = "test.test";
String test3 = "test^test";
String test4 = "test:test";
String test5 = "test-test$";
String test6 = "test-test";
System.out.println(test1.matches(regex));
System.out.println(test2.matches(regex));
System.out.println(test3.matches(regex));
System.out.println(test4.matches(regex));
System.out.println(test5.matches(regex));
System.out.println(test6.matches(regex));

Simple regex to match strings containing <n> chars

I'm writing this regexp as i need a method to find strings that does not have n dots,
I though that negative look ahead would be the best choice, so far my regexp is:
"^(?!\\.{3})$"
The way i read this is, between start and end of the string, there can be more or less then 3 dots but not 3.
Surprisingly for me this is not matching hello.here.im.greetings
Which instead i would expect to match.
I'm writing in Java so its a Perl like flavor, i'm not escaping the curly braces as its not needed in Java
Any advice?

You're on the right track:
"^(?!(?:[^.]*\\.){3}[^.]*$)"
will work as expected.
Your regex means
^ # Match the start of the string
(?!\\.{3}) # Make sure that there aren't three dots at the current position
$ # Match the end of the string
so it could only ever match the empty string.
My regex means:
^ # Match the start of the string
(?! # Make sure it's impossible to match...
(?: # the following:
[^.]* # any number of characters except dots
\\. # followed by a dot
){3} # exactly three times.
[^.]* # Now match only non-dot characters
$ # until the end of the string.
) # End of lookahead
Use it as follows:
Pattern regex = Pattern.compile("^(?!(?:[^.]*\\.){3}[^.]*$)");
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();

Your regular expression only matches 'not' three consecutive dots. Your example seems to show you want to 'not' match 3 dots anywhere in the sentence.
Try this: ^(?!(?:.*\\.){3})
Demo+explanation: http://regex101.com/r/bS0qW1
Check out Tims answer instead.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex and java : Generating a regex for a String [duplicate] - java

matcher.matches() tries to match against the entire input string. Look at the javadoc for java.util.regex.Matcher. You need to use: matcher.find()

Try this regex, This will grab any grouping of characters at the start followed by a comma but not the comma itself. ^.*(?=,)

If you are only interested to check that "Bangalore" is contained in the string then you don't need a regexp for this. Python: In [1]: s = 'Bangalorejkdjiefjiojhdu' In [2]: 'Bangalore' in s Out[2]: True

Related

Regex pattern matching with multiple strings

Regex match certain IDs

How to match the middle character in a string with regex?

Regexp: Specific characters in the text

Simple regex to match strings containing <n> chars

Categories

Resources