Regex: match everything the other regex left - java

I am struggling with the following issue: say there's a regex 1 and there's regex 2 which should match everything the regex 1 does not.
Let's have the regex 1:
/\$\d+/ (i.e. the dollar sign followed by any amount of digits.
Having a string like foo$12___bar___$34wilma buzz it detects $12 and $34.
How does the regex 2 should look in order to match the remained parts of the aforementioned string, i.e. foo, ___bar___ and wilma buzz? In other words it should pick up all the "remained" chunks of the source string.

You may use String#split to split on given regex and get remaining substrings in an array:
String[] arr = str.split( "\\$\\d+" );
//=> ["foo", "___bar___", "wilma buzz"]
RegEx Demo

It was tricky to get this working, but this regex will match everything besides \$\d+ for you. EDIT: no longer erroneously matches $44$444 or similar.
(?!\$\d+)(.+?)\$\d+|\$\d+|(?!\$\d+)(.+)
Breakdown
(?!\$\d+)(.+?)\$\d+
(?! ) negative lookahead: assert the following string does not match
\$\d+ your pattern - can be replaced with another pattern
(.+?) match at least one symbol, as few as possible
\$\d+ non-capturing match your pattern
OR
\$\d+ non-capturing group: matches one instance of your pattern
OR
(?!\$\d+)(.+)
(?!\$\d+) negative lookahead to not match your pattern
(.+) match at least one symbol, as few as possible
GENERIC FORM
(?!<pattern>)(.+?)<pattern>|<pattern>|(?!<pattern>)(.+)
By replacing <pattern>, you can match anything that doesn't match your pattern. Here's one that matches your pattern, and here's an example of arbitrary pattern (un)matching.
Good luck!

Try this one
[a-zA-Z_]+
Or even better
[^\$\d]+ -> With the ^symbol you can negotiate the search like ! in the java -> not equal

Related

Regular expression to determine if the String consists of more than 4 numbers

I want to extract URL strings from a log which looks like below:
<13>Mar 27 11:22:38 144.0.116.31 AgentDevice=WindowsDNS AgentLogFile=DNS.log PluginVersion=X.X.X.X Date=3/27/2019 Time=11:22:34 AM Thread ID=11BC Context=PACKET Message= Internal packet identifier=0000007A4843E100 UDP/TCP indicator=UDP Send/Receive indicator=Snd Remote IP=X.X.X.X Xid (hex)=9b01 Query/Response=R Opcode=Q Flags (hex)=8081 Flags (char codes)=DR ResponseCode=NOERROR Question Type=A Question Name=outlook.office365.com
I am looking to extract Name text which contains more that 5 digits.
A possible way suggested is (\d.*?){5,} but does not seem to work, kindly suggest another way get the field.
Example of string match:
outlook12.office345.com
outlook.office12345.com
You can look for the following expression:
Name=([^ ]*\d{5,}[^ ]*)
Explanation:
Name= look for anything that starts with "Name=", than capture if:
[^ ]* any number of characters which is not a space
\d{5,} then 5 digits in a row
[^ ]* then again, all digits up to a white space
This regular expression:
(?<=Name=).*\d{5,}.*?(?=\s|$)
would extract strings like outlook.office365666.com (with 5 or more consecutive digits) from your example input.
Demo: https://regex101.com/r/YQ5l2w/1
Try this pattern: (?=\b.*(?:\d[^\d\s]*){5,})\S*
Explanation:
(?=...) - positive lookahead, assures that pattern inside it is matched somewhere ahead :)
\b - word boundary
(?:...) - non-capturing group
\d[^\d\s]* - match digit \d, then match zero or more of any characters other than whitespace \s or digit \d
{5,} - match preceeding pattern 5 or more times
\S* - match zero or more of any characters other than space to match the string if assertion is true, but I think you just need assertion :)
Demo
If you want only consecutive numbers use simplified pattern (?=\b.*\d{5,})\S*.
Another demo
Of course, you have to add positive lookbehind: (?<=Name=) to assert that you have Name= string preceeding
Try this regex
([a-z0-9]{5,}.[a-z0-9]{5,})+.com
https://regex101.com/r/OzsChv/3
It Groups,
outlook.office365.com
outlook12.office345.com
also all url strings

Regex with Whitespace

I am try to write a regex to match the following:
act=MATCHME
act=Match me too
I have the following regex to match either one but not both. Here is my effort:
matches MATCHME: act=(\w+)
matches Match me too: (\w+\s\w+\s\w+)
Is there anyway to can combine the two with OR, or may I be looking at this wrong?
I am using the JAVA regex engine.
You may use an optional non-capturing group:
act=(\w+(?:\s+\w+\s+\w+)?)
^^^^^^^^^^^^^^^^^
See the regex demo
The ? matches 1 or 0 occurrences of the quantified subpattern. When it is applied to a grouping construct, the quantification is applied to the whole pattern sequence, so (?:\s+\w+\s+\w+)? matches 1 or 0 sequences of 1+ whitespaces, 1+ word chars, 1+ whitespaces and again 1+ word chars.
You may further subsegment the pattern if you need to capture 2-word substrings after act=.
Surely you know how to compose regular expressions by alternation.
This regular expression may help you
^[a-zA-Z ]*$

remove part of matcher after the match in regex pattern

I need to help in writing regex pattern to remove only part of the matcher from original string.
Original String: 2017-02-15T12:00:00.268+00:00
Expected String: 2017-02-15T12:00:00+00:00
Expected String removes everything in milliseconds.
My regex pattern looks like this: (:[0-5][0-9])\.[0-9]{1,3}
i need this regex to make sure i am removing only the milliseconds from some time field, not everything that comes after dot. But using above regex, I am also removing the minute part. Please suggest and help.
You have defined a capturing group with (...) in your pattern, and you want to have that part of string to be present after the replacement is performed. All you need is to use a backreference to the value stored in this capture. It can be done with $1:
String s = "2017-02-15T12:00:00.268+00:00";
String res = s.replaceFirst("(:[0-5][0-9])\\.[0-9]{1,3}", "$1");
System.out.println(res); // => 2017-02-15T12:00:00+00:00
See the Java demo and a regex demo.
The $1 in the replacement pattern tells the regex engine it should look up the captured group with ID 1 in the match object data. Since you only have one pair of unescaped parentheses (1 capturing group) the ID of the group is 1.
Change your pattern to (?::[0-5][0-9])(\.[0-9]{1,3}), run the find in the matcher and remove all it finds in the group(1).
The backslash will force the match with the '.' char, instead of any char, which is what the dot represents in a regex.
The (?: defines a non-capturing group, so it will not be considered in the group(...) on the matcher.
And adding a parenthesis around what you want will make it show up as group in the matcher, and in this case, the first group.
A good reference is the Pattern javadoc: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
Use $1 and $2 variable for replace
string.replaceAll("(.*)\\.\\d{1,3}(.*)","$1$2");

Regex: Match group if present otherwise ignore and proceed with other matches

I have been trying to match a regex pattern within the following data:
String:
TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error
Words to match:
TestData
267467374736437-TestInfo
Regex pattern i m using:
(.+?\s)?.*(\s\d+-.*?\s)?
Scenario here is that 2nd match (267467374736437-TestInfo) can be absent in the string to be matched. So, i want it to be a match if it exists otherwise proceed with other matches. Due to this i added zero or one match quantifier ? to the group pattern above. But then it ignores the 2nd group all together.
If i use the below pattern:
`(.+?\s)?.*(\s\d+-.*?\s)`
It matches just fine but fails if string "267467374736437-TestInfo" from the matching string as it's not having the "?" quantifier.
Please help me understand where is it going wrong.
I would rather not use a complex regex, which will be ugly and a maintenance nightmare. Instead, one simple way would be to just split the string and grab the first term, and then use a smart regex to pinpoint the second term.
String input = "TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error";
String first = input.split(" ")[0];
String second = input.replaceAll(".*Save Error:\\s(.*)?\\s", "$1");
Explore the regex:
Regex101
The optional pattern at the end will almost never not be matched if a more generic pattern occurs. In your case, the greedy dot .* grabs the whole rest of the line up to the end, and since the last pattern is optional, the regex engine calls it a day and does not try to accommodate any text for it.
If you had a lazy dot .*?, the only position where it would work is right after the preceding subpattern, which is rarely the case.
Thus, you can only rely on a tempered greedy token:
^(\S+)(?:(?!\d+-\S).)*(\d+-\S+)?
See the regex demo.
Or an unrolled version:
^(\S+)\D*(?:\d(?!\d*-\S)\D*)*(\d+-\S+)?

Scala Regexp: find all matches in a string that contain predefined token or any word which consists of 2-4 characters

I have to find all matches in a string that contains predefined tokens (AB- or BCC- or CDD-) or [A-Z]{2,4}-. Predefined tokens have a highest priority.
I mean:
regex.findAllIn("XBCC-").toList must always return List(BCC-), not List(XBCC-)
but:
regex.findAllIn("XTEST-").toList must return List(TEST-)
I try something like that:
val regex = "((AB|BCC|CDD)|[A-Z]{2,4})-".r
But it doesn't work properly.
Don't believe the naysayers. This can quite easily be done with regex:
(?!\w+(?:AB|BCC|CDD)-)[A-Z]{2,4}-
See the demo.
The lookahead assertion here makes sure the pattern doesn't match if AB-, BCC- or CDD- is present later in the text.
Explanation:
(?! assert that there is no...
\w+ ...sequence of characters...
(?:AB|BCC|CDD) ...followed by AB, BCC, or CDD...
- ...and a dash
)
[A-Z]{2,4}- then simply match 2 to 4 characters before a dash

Categories

Resources