regex pattern losses last character

regex pattern losses last character - java

I have the following regex to extract a domain from a url: "^(http:\\/\\/|https:\\/\\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-zA-Z0-9]+)?$" when I get the 3rd group, I get a the domain missing the last charcter in it. for example: facebook becomes faceboo
I'm using Java 8
The regex Works fine in case of having a path (Group 4) that doesn't have any numbers in it.
if I put a number into the 4th group it cuts the domain's last character.

You need to escape the dot characters
"^(http:\\/\\/|https:\\/\\/)?(www\\.)?([a-zA-Z0-9]+)\\.[a-zA-Z0-9]*\\.[a-z]{3}\\.?([a-zA-Z0-9]+)?$"
It's a special character in regex that means "Any character", which will mean it matches a dot, or any letter.

Related

Why does Java-regex matches underscore? [duplicate]

This question already has answers here:
Java RegEx meta character (.) and ordinary dot?
(9 answers)
Closed 2 years ago.
I was trying to match the URL pattern string.string. for any number of string. using ^([^\\W_]+.)([^\\W_]+.)$ as a first attempt, and it works for matching two consecutive patterns. But then, when I generalize it to ^([^\\W_]+.)+$ stops working and matches the wrong pattern "string.str_ing.".
Do you know what is incorrect with the second version?

You need to escape your . character, else it will match any character including _.
^([^\\W_]+\.?)+$
this can be your generalised regex

With ^([^\\W_]+.)([^\\W_]+.)$ you match any two words with restricted set of characters. Although, you have not escaped the ., it still works as long as the first word is matched first string, then any literal (that's what unescaped . means) and then string again.
In the latter one the unescaped dot (.) is a part of the capturing group occurring at least once (since you use +), therefore it allows any character as a divisor. In other words string.str_ing. is understood as:
string as the 1st word
str as the 2nd word
ing as the 3rd word
... as long as the unescaped dot (.) allows any divisor (both . literally and _).
Escape the dot to make the Regex work as intented (demo):
^([^\\W_]+\.)+$

[^\W] seems a weird choice - it's matching 'not not-a-word-character'. I haven't thought it through, but that sounds like it's equivalent to \w, i.e., matching a word character.
Either way, with ^\W and \w, you're asking to match underscores - which is why it matches the string with the underscore. "Word characters" are uppercase alphabetics, lowercase alphabetics, digits, and underscore.
You probably want [a-z]+ or maybe [A-Za-z0-9]+

Some more regex criteria in existing regex

I want to add into below regex which also pass following criteria -
^[\p{L}\d'][ \p{L}\d'-]*[\p{L}\d'-']$
Should start with letter (A-Z or a-z) only.
Can accepts only single letter also.
Accept hyphen (-), Space, dot (.) in between the string or end of the string. (No other special character)
Accept numbers in between and end to the string.
Please also want to achieve existing criteria what this regex is doing.
E.g.
Expected -
t, T, test, test123, te12st, te-st, te.st, te st, éééééé, ṪỲɎɆḂɃɀȿȸȺȔȐȳɊÉâÇë, Επίθετο
Not Expected -
12test, 1, .test, -test, , tes*t (none of the special character except hyphen, dot & space),

To match the expected and not the not expected including a single letter, you could match \pL from the start of the string. Then repeat 0+ times any of the listed in [\d\pL .-] and then assert the end of the string.
Note that not all of your expected start with a-zA-Z.
^\pL[\d\pL .-]*$
In Java
String regex = "^\\pL[\\d\\pL .-]*$";
Regex demo | Java demo

^[A-Za-z]+[\p{L}\d-.\s]*$
This is a possible solution, however these test criteria conflict with your first requirement: éééééé, ṪỲɎɆḂɃɀȿȸȺȔȐȳɊÉâÇë, Επίθετο. Where it 1) accepts one or more of A-Za-z then 2) zero or more combination of letters, numbers, hyphens, space, and periods.
If you want it to also accept those three test criteria then this is a possible solution:
^[\p{L}]+[\p{L}\d-.\s]*$

Get all unique file names

To preface, I am a beginner with regex. I have a string that looks something like:
my_folder/foo.xml::someextracontent
my_folder/foo.xml::someextracontent
another_folder/foo.xml::someextracontent
my_folder/bar.xml::someextracontent
my_folder/bar.xml::someextracontent
my_folder/hello.xml::someextracontent
I want to return unique XML files which are part of my_folder. So the regex will return:
my_folder/foo.xml
my_folder/bar.xml
my_folder/hello.xml
I've taken a look at Extract All Unique Lines which is close to what I need but I am not sure where to go from there.
The closest attempt I got was (?sm)(my_folder\/.*?.xml)(?=.*\1) which gets all the duplicates but I want the opposite, so I tried doing a negative lookahead instead (?sm)(my_folder\/.*?.xml)(?!.*\1) but the capture groups are totally wrong.
What am I missing here in my regex? Here's link to the regex: https://regex101.com/r/ggY2RB/1

This RegEx might help you to find the unique strings that you might be looking for:
/(\w+\/\w+\.xml)(?![\s\S]*\1)/s
If you only wish to match my_folder, you might try this:
/(\my_folder\/\w+\.xml)(?![\s\S]*\1)/s

Instead of using a positive lookahead (?=, to get the unique strings you could use a negative lookahead (?! to assert what is on the right is not what you have captured in group 1.
In your pattern you are using making the dot match a newline using (?s)and use a non greedy dot start .*? but you might also use a negated character class matching not a newline or a forward slash.
If the folder can also contain nested folders, you might use a pattern that repeats 0+ times 1+ whitespace chars followed by a forward slash.
(?s)(my_folder/(?:[^/\n]+/)*[^/\n]+\.xml)::(?!.*\1)
(?s)
( Capture group
my_folder/ Match literally
(?:[^/\n]+/)* Repeat 0+ times not a forward slash or a newline followed by a forward slash
[^/\n]+\.xml Match 1+ ot a forward slash or a newline followed by .xml
) Close capture group
::(?!.*\1) Match :: followed by asserting what is on the right does not contain what is captured in group 1
In Java
String regex = "(?s)(my_folder/(?:[^/\\n]+/)*[^/\\n]+\\.xml)::(?!.*\\1)";
Regex demo | Java demo

RegEx of underscore delimited string

I have a string with 5 pieces of data delimited by underscores:
AAA_BBB_CCC_DDD_EEE
I want a different regex for each component.
The regex needs to return just the one component.
For example, the first would return just AAA, the second for BBB, etc.
I am able to parse out AAA with the following:
^([^_]*)?
I see that I can do a look-around like this to find:
(?<=[^_]*_).*
BBB_CCC_DDD_EEE
But the following can not find just BBB
(?<=[^_]*_)[^_]*(?=_)

Mixing lookbehind and lookahead
^([^_]+)? // 1st
(?<=_)[^_]+ // 2nd
(?<=_)[^_]+(?=_[^_]+_[^_]+$) // 3rd
(?<=_)[^_]+(?=_[^_]+$) // 4th
[^_]+$ // 5th
Just if the lengths of the strings beetween the "_" are known it can be like this
1st match
^([^_]+)?
2nd match
(?<=_)\K[^_]+
3rd match
(?<=_[A-Za-z]{3}_)\K[^_]+
4th match
(?<=_[A-Za-z]{3}_[A-Za-z]{3}_)\K[^_]+
5th match
(?<=_[A-Za-z]{3}_[A-Za-z]{3}_[A-Za-z]{3}_)\K[^_]+
each {3} is expressing the length of the string beetween "_"

If your string is always uses underscores, you might use 1 regex to capture your values in a capturing group by repeating the pattern of what is before (in this case NOT an underscore followed by an underscore) using a quantifier which you can change like {3}.
This way you can specify using the quantifier how many times you want to repeat the pattern before and then capture your match. For your example string AAA_BBB_CCC_DDD_EEE you could use {0}, {1},{2},{3} or {4}
^(?:[^_\n]+_){3}([0-9A-Za-z]+)(?:_[^_\n]+)*$
That would match:
^ Assert position at start of the line
(?:[^_\n]+_){3} In a non capturing group (?:, match NOT and underscore or a new line one or more times [^_\n]+ followed by an underscore and repeat that n times (In this example n is 3 times)
([0-9A-Za-z]+) Capture your characters in a group using for example a character class (or use [^_]+ to match not an underscore but that will also match any white space characters)
(?:_[^_\n]+)* Following after your captured values, repeat in a non capturing group matching an underscore, NOT and underscore or a new line one or more times and repeat that pattern zero or more times to get a full match
$ Assert position at the end of the line

Java regular expression to validate numeric comma separated number and hyphen

Valid1: 2
valid2: 3-5
Valid3: 2,4-6
valid4: 2,4,5
valid5: 2-7,8-9
Valid4: 2,5-7,9-13,15,17-20
All the expression on the above should be valid in one regex.
the digit in the left side of hyphen should be smaller than right hand side.

First, as #MikeFHay suggested above, regex were not made to check if one digit is bigger than the other (for that you'll have to parse the expression). If we'll ignore that requirement - the rest can be achieved via the following regex:
((\d\,(?=\d))|(\d\-(?=\d))|\d)+
in Java:
"((\\d\\,(?=\\d))|(\\d\\-(?=\\d))|\\d)+"
Explanation:
This regex uses lookahead to validate that each comma or dash is preceded and followed by a digit: (\d\,(?=\d)) so that each "substring" that contains a dash/comma will have to be in the format of: digit,digit or digit-digit.
Of course that a number that doesn't contain commas/dashes is also valid - hence the rightmost side of the or which is simply a \d
Link to online demo

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex pattern losses last character - java

You need to escape the dot characters "^(http:\\/\\/|https:\\/\\/)?(www\\.)?([a-zA-Z0-9]+)\\.[a-zA-Z0-9]*\\.[a-z]{3}\\.?([a-zA-Z0-9]+)?$" It's a special character in regex that means "Any character", which will mean it matches a dot, or any letter.

Related

Why does Java-regex matches underscore? [duplicate]

Some more regex criteria in existing regex

Get all unique file names

RegEx of underscore delimited string

Java regular expression to validate numeric comma separated number and hyphen

Categories

Resources