How do I change the following pattern to find all possible matches? - java

I have a Java pattern here
String patternString = "(#)(.+?)([\\s,#.])";
I basically want to find all words beginning with a '#' in a given text string. The pattern matches all words except the last one if it is followed by an end line. I am using a hash map to store the values.
int x = 0;
HashMap<Integer, String> values = new HashMap<>();
while(matcher.find()) {
values.put(x++, matcher.group(2));
}
I have tried putting a '$' symbol in the third to match the group but it doesn't seem to work. How do I tweak the pattern to match all words beginning with a '#' that includes the last word too?

Unless I misunderstood your requirements, it can be much simpler. I'd suggest using the following pattern:
(#)([^\s]+)
It matches a # followed by as many non-white space characters as possible. You'll have to change you code to use group 1 instead of group 2, as my pattern doesn't have 3 groups.
Depending on your exact requirements you can also use \w instead of [^\s] to match any word character (equivalent to [a-zA-Z0-9_]).

Related

How to match all combinations of numbers in a string that do not start with an English letter in regular matching in Java

I have a String like
String str = "305556710S or 100596269C OR CN111111111";
I just want to match the characters in this string that start with numbers or start with numbers and end with English letters,
Then prefix the matched characters add with two "??" characters.
I write a Patern like
Pattern pattern = Pattern.compile("^[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String matchStr = matcher.group();
System.err.println(matchStr);
}
But it can only match the first character "305556710S".
But If I modify the Pattern
Pattern pattern = Pattern.compile("[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
It will matches "305556710S","100596269C","111111111".But the prefix of "111111111" is English character "CN" which is not my goal.
I only want match the "305556710S" and "100596269C" and add two "??" characters before the matched Characters.Can somebody help me ?
First, you should avoid the ^ in this particular regexp. As you noticed, you can't return more than one result, as "^" is an instruction for "match the beginning of the string"
Using \b can be a solution, but you may get invalid results. For example
305556710S or -100596269C OR CN111111111
The regexp "\\b[0-9]{1,10}[A-Z]{0,}\\b" will match 100596269C (because the hyphen is not word character, so there is a word boundary between - and 1)
The following regexp matches exactly what you want: all numbers, that may be followed by some English chars, either at the beginning of the string or after a space, and either followed by a space or at the end of the string.
(?<=^| )[0-9]{1,10}[A-Z]*(?= |$)
Explanations:
(?<=^| ) is a lookbehind. It makes sure that there is either ^ (string start) or a space behind actual location. Note that lookbehinds don't add matching chars to the result: the space won't be part of the result
[0-9]{1,10}[A-Z]* matches digits (at least one, up to ten), then one or more letters.
(?= |$) is a lookahead. It makes sure that there will be either a space or $ (end of string) after this match. Like lookbehinds, the chars aren't added to the results and position remains the same : the space read here for example can also be read by the lookbehind of the next captured string
Examples : 305556710S or 100596269C OR CN111111111 matches: at index 0 [305556710S], at index 15 [100596269C]; 100596269C123does not match.
I think you need to use word boundaries \b. Try this changed pattern:
"\\b[0-9]{1,10}[A-Z]{0,1}\\b"
This prints out:
305556710S
100596269C
Why it works:
The difference here is that it will check only those character sequences that are within a pair of word boundaries. In the earlier pattern you used, a character sequence even from the middle of a word may be used to match against the pattern due to which even 11111... from CN1111... was matched against the pattern and it passed.
A word boundary also matches the end of the string input. So, even if a candidate word appears at the end of the line, it will get picked up.
If more than one English alphabet can come at the end, then remove the max occurrence indicator, 1 in this case:
"\\b[0-9]{1,10}[A-Z]{0,}\\b"

regex break string into words in dictionary

I want to create an regex in order to break a string into words in a dictionary. If the string matches, I can iterate each group and make some change. some of the words are prefix of others. However, a regex like /(HH|HH12)+/ will not match string HH12HH link. what's wrong with the regex? should it match the first HH12 and then HH in the string?
You want to match an entire string in Java that should only contain HH12 or HH substrings. It is much easier to do in 2 steps: 1) check if the string meets the requirements (here, with matches("(?:HH12|HH)+")), 2) extract all tokens (here, with HH12|HH or HH(?:12)?, since the first alternative in an unanchored alternation group "wins" and the rest are not considered).
String str = "HH12HH";
Pattern p = Pattern.compile("HH12|HH");
List<String> res = new ArrayList<>();
if (str.matches("(?:HH12|HH)+")) { // If the whole string consists of the defined values
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
}
System.out.println(res); // => [HH12, HH]
See the Java demo
An alternative is a regex that will check if a string meets the requirements with a lookahead at the beginning, and then will match consecutive tokens with a \G operator:
String str = "HH12HH";
Pattern p = Pattern.compile("(\\G(?!^)|^(?=(?:HH12|HH)+$))(?:HH12|HH)");
List<String> res = new ArrayList<>();
Matcher m = p.matcher(str);
while (m.find()) {
res.add(m.group());
}
System.out.println(res);
See another Java demo
Details:
(\\G(?!^)|^(?=(?:HH12|HH)+$)) - the end of the previous successful match (\\G(?!^)) or (|) start of string (^) that is followed with 1+ sequences of HH12 or HH ((?:HH12|HH)+) up to the end of string ($)
(?:HH12|HH) - either HH12 or HH.
In the string HH12HH, the regex (HH|HH12)+ will work this way:
HH12HH
^ - both option work, continue
HH12HH
^ - First condition is entierly satisfied, mark it as match
HH12HH
^ - No Match
HH12HH
^ - No Match
As you setted the A flag, which add the anchor to the start of the string, the rest will not raise a match. If you remove it, the pattern will match both HH at the start & at the end.
In this case, you have three options:
Put the longuest pattern first /(HH12|HH)/Ag. See demoThe one I prefer.
Mutualize the sharing part and use an optional group /(HH(?:12)?)/Ag. See second demo
Put a $ at the end like so /(HH|HH12)$/Ag
The problem you are having is entirely related to the way the regex engine decides what to match.
As I explained here, there are some regex flavors that pick the longest alternation... but you're not using one. Java's regex engine is the other type: the first matching alternation is used.
Your regex works a lot like this code:
if(bool1){
// This is where `HH` matches
} else if (bool1 && bool2){
// This is where `HH12` would match, but this code will never execute
}
The best way to fix this is to order your words in reverse, so that HH12 occurs before HH.
Then, you can just match with an alteration:
HH12|HH
It should be pretty obvious what matches, since you can get the results of each match.
(You could also put each word in its own capture group, but that's a bit harder to work with.)

Regex in Java: Capture last {n} words

Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?
You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.
Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo

capture all characters between match character (single or repeated) on string

I'm trying to extract the string preceding a specific character (even when character is repeated, like this (ie: underscore '_'):
this_is_my_example_line_0
this_is_my_example_line_1_
this_is_my_example_line_2___
_this_is_my_ _example_line_3_
__this_is_my___example_line_4__
and after running my regex I should get this (the regex should ignore the any instances of the matching character in the middle of the string):
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4
In other words I'm trying to 'trim' the matched character(s) at the beginning and end of string.
I'm trying to use a Regex in Java to accomplish this, my idea is to capture the group of characters between the special character(s) at the end or beginning of the line.
So far I can only do this successfully for example 3 with this regexp:
/[^_]+|_+(.*)[_$]+|_$+/
[^_]+ not 'underscore' once or more
| OR
_+ underscore once or more
(.*) capture all characters
[_$]+ not 'underscore' once or more followed by end of line
|_$+ OR 'underscore' once or more followed by end of line
I just realized that this excludes the first word of the message on example 0,1,2 since the string doesn't start with underscore and it only starts matching after finding a underscore..
Is there an easier way not involving regex?
I don't really care about the first character (although it would be nice) I only need to ignore the repeating character at the end.. it looks that (by this regex tester) just doing this, would work? /()_+$/ the empty parenthesis matches anything before a single or repeting matches at the end of the line.. would that be correct?
Thank you!
There are a couple of options here, you could either replace matches of ^_+|_+$ with an empty string, or extract the contents of the first capture group from the match of ^_*(.*?)_*$. Note that if your strings may be multiple lines and you want to perform the replacement on each line then you will need to use the Pattern.MULTILINE flag for either approach. If your strings may be multiple lines and you only want to replacement to occur at the very beginning and end, don't use Pattern.MULTILINE but use Pattern.DOTALL for the second approach.
For example: http://regexr.com?355ff
How about [^_\n\r](.*[^_\n\r])??
Demo
String data=
"this_is_my_example_line_0\n" +
"this_is_my_example_line_1_\n" +
"this_is_my_example_line_2___\n" +
"_this_is_my_ _example_line_3_\n" +
"__this_is_my___example_line_4__";
Pattern p=Pattern.compile("[^_\n\r](.*[^_\n\r])?");
Matcher m=p.matcher(data);
while(m.find()){
System.out.println(m.group());
}
output:
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4

Need regex to separate comma separated values (Interface list from a router query output)

I have an input like this
RX Only : Gi1/0/15,Gi1/0/20,Gi1/0/17
I want to capture 1/0/15, 1/0/20, 1/0/17 from this. But this input changes. Sometimes there are only 2 comma separated values, sometimes 1 sometimes more than 3.
The regex I came up with only captures the first group. If I use the non-greedy operator, then it captures last. What regex should I use to capture all these groups separately.
The language used would be Java.
it's often easier to just write the regex for the substrings you are interested in, then repeatedly use Matcher.find(), as opposed to trying to write a regex that matches the entire string and pulling what you want from a complex arrangement of groups.
assuming what you are looking for are triples of three numbers separated by "/", then,
Pattern p = Pattern.compile("\\d+/\\d+/\\d+");
Matcher m = p.matcher(inputString);
while (m.find()) {
// your triple is in group 0
System.out.println(m.group(0));
}
Give a man a fish ... or
http://gskinner.com/RegExr/
Do you really have to use regex here? If data formats are quite similar you can just use indexOf function combined with substring. You will have to find the : character and start finding comas starting from the next character. Then you check the position of \n and use the smaller index in order to retrieve a substring.

Categories

Resources