JAVA REGEX :: Could you explain this? - java

My pattern is [a-z][\\*\\+\\-_\\.\\,\\|\\s]?\\b
My Result:
a__
not matched
a_.
pattern matched = a_
a._
pattern matched = a.
a..
pattern matched = a
why my first input is alone not matched???
Thanks in advance.
[ PS: got the same result with [a-z][\\*\\+\\-\\_\\.\\,\\|\\s]?\\b ]

Because unlike the period ., the underscore _ is considered to be a word character; so a_ is one word, but a. is a word with interpunction.
So, a__ matches a, then matches _, then fails to match a word boundary (since the next _ is a part of the same word).
a.. matches a, skips the character range, then matches the word boundary between the word a and the interpunction ..

With the regex rewritten in a "proper way", that is:
"[a-z][*+\\-_.,|\\s]?\\b"
Or, in an "unquoted", canonical way:
[a-z][*+\-_.,|\s]?\b
that your first input does not match is expected; a character class will only ever match one character. After it matches the first underscore, it looks for a word boundary, but cannot find one: for the Java regex engine, _ is a character which can be part of a word. Hence the result.

Related

regex match between two characters with or condition

I might be thinking of this wrong, but I'm trying to match all things between "_" characters but
I also need the last item (the datetime stamp) Here is the string:
StringText_62_590285_20200324082238.xml
Here is the regex I have started with (java):
\_(.*?)\_
but this only matches to: "_62_"
See here
The result I'm trying to get to is to have 3 matches (62, 590285, 20200324082238)
Now that I'm thinking about this, am I approaching this wrong? This input string is going to be very consistent and maybe just match all strings that are numbers?
For the example provided, you may use this regex:
(?<=_)[^_.]+
RegEx Demo
RegEx Demo:
(?<=_): Lookbehind to assert that we have a _ before current position
[^_.]+: Match 1+ of any character that is not a _ and not a dot
You can use the word boundaries with _ excluded:
(?<![^\W_])\d+(?![^\W_])
See the regex demo. Details:
(?<![^\W_]) - immediately on the left, there can't be a letter or digit
\d+ - one or more digits
(?![^\W_]) - immediately on the right, there can't be a letter or digit.
See the Java demo:
String s = "StringText_62_590285_20200324082238.xml";
Pattern pattern = Pattern.compile("(?<![^\\W_])\\d+(?![^\\W_])");
Matcher matcher = pattern.matcher(s);
List<String> results = new ArrayList<>();
while (matcher.find()){
results.add(matcher.group(0));
}
System.out.println(results); // => [62, 590285, 20200324082238]
I actually suggest not to use regex in this case but to use two splits, the first split with "_" where you will obtain 4 chunks, you will take the last three and then apply the second split on the last element with "."
This regex does the work anyways:
\d[0-9]*
Modifying your example, you can use something like this:
(?<=_)(.*?)(?=_|\.)
This will basically mean:
Match everything that is preceded by _
and followed by _ or .

Java Regexp to match words only (', -, space)

What is the Java Regular expression to match all words containing only :
From a to z and A to Z
The ' - Space Characters but they must not be in the beginning or the
end.
Examples
test'test match
test' doesn't match
'test doesn't match
-test doesn't match
test- doesn't match
test-test match
You can use the following pattern: ^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$
Below are the examples:
String s1 = "abc";
String s2 = " abc";
String s3 = "abc ";
System.out.println(s1.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s2.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s3.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
When you mean the whitespace char it is: [a-zA-Z ]
So it checks if your string contains a-z(lowercase) and A-Z(uppercase) chars and the whitespace chars. If not, the test will fail
Here's my solution:
/(\w{2,}(-|'|\s)\w{2,})/g
You can take it for a spin on Regexr.
It is first checking for a word with \w, then any of the three qualifiers with "or" logic using |, and then another word. The brackets {} are making sure the words on either end are at least 2 characters long so contractions like don't aren't captured. You could set that to any value to prevent longer words from being captured or omit them entirely.
Caveat: \w also looks for _ underscores. If you don't want that you could replace it with [a-zA-Z] like so:
/([a-zA-Z]{2,}(-|'|\s)[a-zA-Z]{2,})/g

Extract string without last char if vowel

With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])
Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog
Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.
this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo

What is the responsibility of (.*) in the Java String?

What is the responsibility of (.*) in the third line and how it works?
String Str = new String("Welcome to Tutorialspoint.com");
System.out.print("Return Value :" );
System.out.println(Str.matches("(.*)Tutorials(.*)"));
.matches() is a call to parse Str using the regex provided.
Regex, or Regular Expressions, are a way of parsing strings into groups. In the example provided, this matches any string which contains the word "Tutorials". (.*) simply means "a group of zero or more of any character".
This page is a good regex reference (for very basic syntax and examples).
Your expression matches any word prefixed and suffixed by any character of word Tutorial. .* means occurrence of any character any number of times including zero times.
The . represents regular expression meta-character which means any character.
The * is a regular expression quantifier, which means 0 or more occurrences of the expression character it was associated with.
matches takes regular expression string as parameter and (.*) means capture any character zero or more times greedily
.* means a group of zero or more of any character
In Regex:
.
Wildcard: Matches any single character except \n
for example pattern a.e matches ave in nave and ate in water
*
Matches the previous element zero or more times
for example pattern \d*\.\d matches .0, 19.9, 219.9
There is no reason to put parentheses around the .*, nor is there a reason to instantiate a String if you've already got a literal String. But worse is the fact that the matches() method is out of place here.
What it does is greedily matching any character from the start to the end of a String. Then it backtracks until it finds "Tutorials", after which it will again match any characters (except newlines).
It's better and more clear to use the find method. The find method simply finds the first "Tutorials" within the String, and you can remove the "(.*)" parts from the pattern.
As a one liner for convenience:
System.out.printf("Return value : %b%n", Pattern.compile("Tutorials").matcher("Welcome to Tutorialspoint.com").find());

Matching '_' and '-' in java regexes

I had this regex in java that matched either an alphanumeric character or the tilde (~)
^([a-z0-9])+|~$
Now I have to add also the characters - and _ I've tried a few combinations, neither of which work, for example:
^([a-zA-Z0-9_-])+|~$
^([a-zA-Z0-9]|-|_)+|~$
Sample input strings that must match:
woZOQNVddd
00000
ncnW0mL14-
dEowBO_Eu7
7MyG4XqFz-
A8ft-y6hDu
~
Any clues / suggestion?
- is a special character within square brackets. It indicates a range. If it's not at either end of the regex it needs to be escaped by putting a \ before it.
It's worth pointing out a shortcut: \w is equivalent to [0-9a-zA-Z_] so I think this is more readable:
^([\w-]+|~$
You need to escape the -, like \-, since it is a special character (the range operator). _ is ok.
So ^([a-z0-9_\-])+|~$.
Edit: your last input String will not match because the regular expression you are using matches a string of alphanumeric characters (plus - and _) OR a tilde (because of the pipe). But not both. If you want to allow an optional tilde on the end, change to:
^([a-z0-9_\-])+(~?)$
If you put the - first, it won't be interpreted as the range indicator.
^([-a-zA-Z0-9_])+|~$
This matches all of your examples except the last one using the following code:
String str = "A8ft-y6hDu ~";
System.out.println("Result: " + str.matches("^([-a-zA-Z0-9_])+|~$"));
That last example won't match because it doesn't fit your description. The regex will match any combination of alphanumerics, -, and _, OR a ~ character.

Categories

Resources