regex expression to match but exclude from result - java

Hi i have the following text in log file
projectId:1 name:John
projectId:63232 name:Sam
telno:0232453242323
The regex expression should only return
1
63232
Currently i've got the following regex projectId:\d* which unwantedly matches the 'projectId:'. How do i omit that from final matches?
Using the solution given in java
String term = "ProjectId:11414084 Title:Recherche partenariat";
String regex = "(?<=ProjectId:)\\d*";
Pattern r = Pattern.compile(regex);
Matcher m = r.matcher(term);
m.matches();
m.group();
The following throw exception
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)

Just use the global identifier which does not return on first match.
Depending on the programming language you use, there are different ways to use the global flag. If you tell us more about the usage, I could give you further information on how to.
I see you updated your question.
For only retrieving the number use the positive lookbehind like this:
(?<=projectId:)\d*
Here a regex101 example

Use lookbehind:
(?<=projectId:)\d+
Look-aheads and look-behinds let you conditionally match items without becoming part of the match themselves.
Demo.

Rather than excluding anything from a match, you may capture some specific part of it using a pair of unescaped parentheses:
\bprojectId:(\d*)
^ ^
or - if there must be at least one digit:
\bprojectId:(\d+)
^
Now, the value you need is in Group 1. See the regex demo.
Note I added \b that is a word boundary, nonproject:234 won't get matched now.
See more about capturing groups here.

Related

remove part of matcher after the match in regex pattern

I need to help in writing regex pattern to remove only part of the matcher from original string.
Original String: 2017-02-15T12:00:00.268+00:00
Expected String: 2017-02-15T12:00:00+00:00
Expected String removes everything in milliseconds.
My regex pattern looks like this: (:[0-5][0-9])\.[0-9]{1,3}
i need this regex to make sure i am removing only the milliseconds from some time field, not everything that comes after dot. But using above regex, I am also removing the minute part. Please suggest and help.
You have defined a capturing group with (...) in your pattern, and you want to have that part of string to be present after the replacement is performed. All you need is to use a backreference to the value stored in this capture. It can be done with $1:
String s = "2017-02-15T12:00:00.268+00:00";
String res = s.replaceFirst("(:[0-5][0-9])\\.[0-9]{1,3}", "$1");
System.out.println(res); // => 2017-02-15T12:00:00+00:00
See the Java demo and a regex demo.
The $1 in the replacement pattern tells the regex engine it should look up the captured group with ID 1 in the match object data. Since you only have one pair of unescaped parentheses (1 capturing group) the ID of the group is 1.
Change your pattern to (?::[0-5][0-9])(\.[0-9]{1,3}), run the find in the matcher and remove all it finds in the group(1).
The backslash will force the match with the '.' char, instead of any char, which is what the dot represents in a regex.
The (?: defines a non-capturing group, so it will not be considered in the group(...) on the matcher.
And adding a parenthesis around what you want will make it show up as group in the matcher, and in this case, the first group.
A good reference is the Pattern javadoc: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
Use $1 and $2 variable for replace
string.replaceAll("(.*)\\.\\d{1,3}(.*)","$1$2");

Regex: Match group if present otherwise ignore and proceed with other matches

I have been trying to match a regex pattern within the following data:
String:
TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error
Words to match:
TestData
267467374736437-TestInfo
Regex pattern i m using:
(.+?\s)?.*(\s\d+-.*?\s)?
Scenario here is that 2nd match (267467374736437-TestInfo) can be absent in the string to be matched. So, i want it to be a match if it exists otherwise proceed with other matches. Due to this i added zero or one match quantifier ? to the group pattern above. But then it ignores the 2nd group all together.
If i use the below pattern:
`(.+?\s)?.*(\s\d+-.*?\s)`
It matches just fine but fails if string "267467374736437-TestInfo" from the matching string as it's not having the "?" quantifier.
Please help me understand where is it going wrong.
I would rather not use a complex regex, which will be ugly and a maintenance nightmare. Instead, one simple way would be to just split the string and grab the first term, and then use a smart regex to pinpoint the second term.
String input = "TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error";
String first = input.split(" ")[0];
String second = input.replaceAll(".*Save Error:\\s(.*)?\\s", "$1");
Explore the regex:
Regex101
The optional pattern at the end will almost never not be matched if a more generic pattern occurs. In your case, the greedy dot .* grabs the whole rest of the line up to the end, and since the last pattern is optional, the regex engine calls it a day and does not try to accommodate any text for it.
If you had a lazy dot .*?, the only position where it would work is right after the preceding subpattern, which is rarely the case.
Thus, you can only rely on a tempered greedy token:
^(\S+)(?:(?!\d+-\S).)*(\d+-\S+)?
See the regex demo.
Or an unrolled version:
^(\S+)\D*(?:\d(?!\d*-\S)\D*)*(\d+-\S+)?

Can't get a match for regular expression in Java

This is the format/example of the string I want to get data:
<span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#B82933;font-size:120%' href='/cartelera/pelicula/18312'>Español </a></span><br><span style='display:block;margin-bottom:3px;'><a style='margin:4px;color:#FBEBC4;font-size:120%' href='/cartelera/pelicula/18313'>Subtitulada </a></span><br> </div>
And this is the regular expression I'm using for it:
"pelicula/([0-9]*)'>([\\w\\s]*)</a>"
I tested this regular expression in RegexPlanet, and it turned out OK, it gave me the expected result:
group(1) = 18313
group(2) = Subtitulada
But when I try to implement that regular expression in Java, it won't match anything. Here's the code:
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>");
Matcher matcher = pattern.matcher(inputLine);
while(matcher.find()){
version = matcher.group(2);
}
}
What's the problem? If the regular expression is already tested, and in that same code I search for more patterns but I'm having trouble with two (I'm showing you here just one). Thank you in advance!
_EDIT__
I discovered the problem... If I check the sourcecode of the page it shows everything, but when I try to consume it from Java, it gets another sourcecode. Why? Because this page asks for your city so it can show information about that. I don't know if there's a workaround about that to actually access the information I want, but that's it.
Your regex is correct but it seems \w does not match ñ.
I changed the regex to
"pelicula/([0-9]*)'>(.*?)</a>"
and it seems to match both the occurrences.
Here I've used the reluctant *? operator to prevent .* match all characters in between first <a> till last <\a>
See What is the difference between `Greedy` and `Reluctant` regular expression quantifiers? for explanation.
#Bohemian is correct in pointing out that you might need to enable the Pattern.DOTALL flag as well if the text in <a> has line breaks
If your input is over several lines (ie it contains newline characters) you'll need to turn on "dot matches newline".
There are two way to do this:
Use the "dot matches newline" regex switch (?s) in your regex:
Pattern pattern = Pattern.compile("(?s)pelicula/([0-9]*)'>([\\w\\s]*)</a>");
or use the Pattern.DOTALL flag in the call to Pattern.compile():
Pattern pattern = Pattern.compile("pelicula/([0-9]*)'>([\\w\\s]*)</a>", Pattern.DOTALL);

Reason for the behavior of the reluctunt quantifier ?? in java regex

I know that the ? is a greedy quantifier and ?? is the reluctant one for it.
When I use it as follows it gives me an empty output always? Is it because of it always operates from left to right (first looking at the zero occurrence then the matched occurrence) or another one?
Pattern pattern = Pattern.compile("a??");
Matcher matcher = pattern.matcher("aba");
while(matcher.find()){
System.out.println(matcher.start()+"["+matcher.group()+"]"+matcher.end());
}
Output :
0[]0
1[]1
2[]2
3[]3
Your regex could be explained as follows: "try to match zero characters, and if that fails try to match one 'a' character".
Trying to match zero characters will always succeed, so there is really no purpose for a regex that only contains a single reluctant element.
I'm not sure about the Java implementation but regular-expressions.info states this for ?? :
Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. This construct is often excluded from documentation because of its limited use.
Thus you get 4 matches (3 character positions + the empty string at the ent) and the optional a is excluded from each of those.

How can I make a Java regex all or nothing?

I'm trying to make a regex all or nothing in the sense that the given word must EXACTLY match the regular expression - if not, a match is not found.
For instance, if my regex is:
^[a-zA-Z][a-zA-Z|0-9|_]*
Then I would want to match:
cat9
cat9_
bob_____
But I would NOT want to match:
cat7-
cat******
rango78&&
I want my regex to be as strict as possible, going for an all or nothing approach. How can I go about doing that?
EDIT: To make my regex absolutely clear, a pattern must start with a letter, followed by any number of numbers, letters, or underscores. Other characters are not permitted. Below is the program in question I am using to test out my regex.
Pattern p = Pattern.compile("^[a-zA-Z][a-zA-Z|0-9|_]*");
Scanner in = new Scanner(System.in);
String result = "";
while(!result.equals("-1")){
result = in.nextLine();
Matcher m = p.matcher(result);
if(m.find())
{
System.out.println(result);
}
}
I think that if you use String.matches(regex), then you will get the effect you are looking for. The documentation says that matches() will return true only if the entire string matches the pattern.
The regex won't match the second example. It's already strict, since * and & are not in the allowed set of characters.
It may match a prefix, but you can avoid this by adding '$' to the end of the regex, which explicitly matches end of input. So try,
^[a-zA-Z][a-zA-Z|0-9|_]*$
This will ensure the match is against the entire input string, and not just a prefix.
Note that \w is the same as [A-Za-z0-9_]. And you need to anchor to the end of the string like so:
Pattern p = Pattern.compile("^[a-zA-Z]\\w*$")

Categories

Resources