I might be thinking of this wrong, but I'm trying to match all things between "_" characters but
I also need the last item (the datetime stamp) Here is the string:
StringText_62_590285_20200324082238.xml
Here is the regex I have started with (java):
\_(.*?)\_
but this only matches to: "_62_"
See here
The result I'm trying to get to is to have 3 matches (62, 590285, 20200324082238)
Now that I'm thinking about this, am I approaching this wrong? This input string is going to be very consistent and maybe just match all strings that are numbers?
For the example provided, you may use this regex:
(?<=_)[^_.]+
RegEx Demo
RegEx Demo:
(?<=_): Lookbehind to assert that we have a _ before current position
[^_.]+: Match 1+ of any character that is not a _ and not a dot
You can use the word boundaries with _ excluded:
(?<![^\W_])\d+(?![^\W_])
See the regex demo. Details:
(?<![^\W_]) - immediately on the left, there can't be a letter or digit
\d+ - one or more digits
(?![^\W_]) - immediately on the right, there can't be a letter or digit.
See the Java demo:
String s = "StringText_62_590285_20200324082238.xml";
Pattern pattern = Pattern.compile("(?<![^\\W_])\\d+(?![^\\W_])");
Matcher matcher = pattern.matcher(s);
List<String> results = new ArrayList<>();
while (matcher.find()){
results.add(matcher.group(0));
}
System.out.println(results); // => [62, 590285, 20200324082238]
I actually suggest not to use regex in this case but to use two splits, the first split with "_" where you will obtain 4 chunks, you will take the last three and then apply the second split on the last element with "."
This regex does the work anyways:
\d[0-9]*
Modifying your example, you can use something like this:
(?<=_)(.*?)(?=_|\.)
This will basically mean:
Match everything that is preceded by _
and followed by _ or .
Related
I am working on a middleware tool in which we have an predefined option of using java regular expressions with subStringRegEx( regex , string).
My requirement is to get the required substring between the underscores(_) from given filename( ex: ABC_XYZ_123_adbc1234-ed98_1234.dat).
I have tried below 3 ways and all are working when tested with online tools by selecting java. Whereas not working as expected in my tool, I am getting “ABC_XYZ_123_ adbc1234-ed98” instead of only “adbc1234-ed98” value.
(?:[^_]+)_(?:[^_]+)_(?:[^_]+)_([^_]+)
.*?_.*?_.*?_([^_]+)
^[^_]*_[^_]*_[^_]*_([^_]*)_
Request your suggestions to achieve the solution.
Thanks,
Kumar
With your shown samples, please try following regex. Value is coming in capture group 1, so do replace with $1 while performing substitution.
^(?:.*?_){3}([^_]*)_.*\.dat$
Online Demo for above regex
OR in case format of files could be anything(apart from .dat) then try following.
^(?:.*?_){3}([^_]*)_.*
Online demo for above regex
Explanation: Adding detailed explanation for above regex.
^(?:.*?_){3} ##Matching from starting of value, using non greedy match till _ 3 times in a non capturing group.
([^_]*) ##Creating 1st capturing group which has values till 1st Occurrence of _ in it.
_.*\.dat$ ##Matching from _ to till dat at the end of value.
You can use
^(?:[^_]+_){3}([^_]+).*
and replace with $1. See the regex demo.
Details:
^ - start of string
(?:[^_]+_){3} - three occurrences of any one or more chars other than _ and then a _ char
([^_]+) - Group 1 (referred to with $1 from the replacement pattern): one or more chars other than _
.* - the rest of the string.
Another idea:
^.*_([^_]+)_[0-9]+\.[^._]*$
See this regex demo, and you will still need to replace with $1.
Details:
^ - start of string
.* - any text (not including line break chars, as many as possible)
_ - a _ char
([^_]+) - one or more chars other than _
_ - a _ char
[0-9]+ - one or more digits
\. - a . char (NOTE: \ might need doubling)
[^._]* - any zero or more chars other than . and _
$ - end of string.
Just for completeness, all 3 patterns work but you have to get the value from group 1.
Example
String patterns[] = {
"(?:[^_]+)_(?:[^_]+)_(?:[^_]+)_([^_]+)",
".*?_.*?_.*?_([^_]+)",
"^[^_]*_[^_]*_[^_]*_([^_]*)_"
};
String s = "ABC_XYZ_123_adbc1234-ed98_1234.dat";
for (String p : patterns) {
Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
}
Output
adbc1234-ed98
adbc1234-ed98
adbc1234-ed98
See a Java demo.
You can simply use the String methods to achieve this:
const str = "ABC_XYZ_123_adbc1234-ed98_1234.dat"
const charSet = str.substr(0, str.length-4).split("_").join("")
console.log(charSet)
I'm not sure about the spec for subStringRegEx (regex, string), but if it returns a substring ($0) in string that matches regex, then it should be
String regex = "[^_]+(?=_[^_]*$)";
I'm dealing with regular expressions, but I’m not a big fan of it and I’m obliged to deal with it in my task :(
I have passed hours looking for a solution but everytime I fail to cover all scenarios.
I have to write a regular expression template that supports these patterns:
DYYU-tx-6.7.9.7_6.1.1.0
DYYU-tx-6.7.9.7_60.11.11.09
DYYU-tx-60.70.90.70_6.1.1.0
I feel that this is very simple to do.. So excuse me if it's a stupid question for someone :(
I tried this pattern but it didn’t work :
^.*_.*-.*-([0-9]*)\\..*\\..* $
Any help please.
I will be more than thankful.
There are many patterns in the samples that we can use to design expressions. For instance, we can start with this expression:
^[^-]+-[^-]+-[^_]+_([0-9]+\.){3}[0-9]+$
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^[^-]+-[^-]+-[^_]+_([0-9]+\\.){3}[0-9]+$";
final String string = "DYYU-tx-6.7.9.7_6.1.1.0\n"
+ "DYYU-tx-6.7.9.7_60.11.11.09\n"
+ "DYYU-tx-60.70.90.70_6.1.1.0";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:
Try this one:
^\w+-\w+-(\d+)(\.\d+)+_(\d+\.)+\d+
Demo
In Java most probably sth like this:
"^\\w+-\\w+-(\\d+)(\\.\\d+)+_(\\d+\\.)+\d+"
Explanation:
^\w+-\w+- first two parts, e.g. DYYU-tx-
(\d+)(\.\d+)+_ numbers separated with . ending with _, e.g. 6.7.9.7_
(\d+\.)+\d+ numbers separted with ., e.g. 60.11.11.09
Your pattern does not match because you use .* which will first match until the end of the string. Then you match an _ so it backtracks to the last underscore and try to match the rest of the pattern.
Since there is 1 underscore, you want to match a hyphen that comes after it, but there is no hyphen to match after the underscore so there is no match.
Another way to write it could be using a negated character class [^-] matching not a hyphen instead of using .*
^[^-]+-[^-]+-\d+(?:\.\d+){3}_\d+(?:\.\d+){3} $
Explanation
^ Start of string
[^-]+- Match 1+ times any char other than -
[^-]+- Same as above
\d+(?:\.\d+){3} Math 1+ digits, repeat 3 times matching a . and 1+ digits
_ Match underscore
\d+(?:\.\d+){3} Math 1+ digits, repeat 3 times matching a . and 1+ digits
[ ]$ Match a space (denoted between brackers for clarity) and assert end of string
In Java
String regex = "^[^-]+-[^-]+-\\d+(?:\\.\\d+){3}_\\d+(?:\\.\\d+){3} $";
Regex demo
Note that in your example data, the strings end with a space, and so there is a space before $
DYYU-tx-(?>\d+[._]?){8}
Search for the literal DYYU-tx-
Look for 1 or more digits that may be followed by a . or an _ 8 times.
I assumed that it would always start with DYYU-tx- and that it would always be 4 numbers separated by periods followed by an underscore which would then have 4 more numbers separated by periods.
Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?
You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.
Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo
My pattern is [a-z][\\*\\+\\-_\\.\\,\\|\\s]?\\b
My Result:
a__
not matched
a_.
pattern matched = a_
a._
pattern matched = a.
a..
pattern matched = a
why my first input is alone not matched???
Thanks in advance.
[ PS: got the same result with [a-z][\\*\\+\\-\\_\\.\\,\\|\\s]?\\b ]
Because unlike the period ., the underscore _ is considered to be a word character; so a_ is one word, but a. is a word with interpunction.
So, a__ matches a, then matches _, then fails to match a word boundary (since the next _ is a part of the same word).
a.. matches a, skips the character range, then matches the word boundary between the word a and the interpunction ..
With the regex rewritten in a "proper way", that is:
"[a-z][*+\\-_.,|\\s]?\\b"
Or, in an "unquoted", canonical way:
[a-z][*+\-_.,|\s]?\b
that your first input does not match is expected; a character class will only ever match one character. After it matches the first underscore, it looks for a word boundary, but cannot find one: for the Java regex engine, _ is a character which can be part of a word. Hence the result.
I am trying to search this string:
,"tt" : "ABC","r" : "+725.00","a" : "55.30",
For:
"r" : "725.00"
And here is my current code:
Pattern p = Pattern.compile("([r]\".:.\"[+|-][0-9]+.[0-9][0-9]\")");
Matcher m = p.matcher(raw_string);
I've been trying multiple variations of the pattern, and a match is never found. A second set of eyes would be great!
Your regexp actually works, it's almost correct
Pattern p = Pattern.compile("\"[r]\".:.\"[+|-][0-9]+.[0-9][0-9]\"");
Matcher m = p.matcher(raw_string);
if (m.find()){
String res = m.toMatchResult().group(0);
}
The next line should read:
if ( m.find() ) {
Are you doing that?
A few other issues: You're using . to match the spaces surrounding the colon; if that's always supposed to be whitespace, you should use + (one or more spaces) or \s+ (one or more whitespace characters). On the other hand, the dot between the digits is supposed to match a literal ., so you should escape it: \. Of course, since this is a Java String literal, you need to escape the backslashes: \\s+, \\..
You don't need the square brackets around the r, and if you don't want to match a | in front of the number you should change [+|-] to [+-].
While some of these issues I've mentioned could result in false positives, none of them would prevent it from matching valid input. That's why I suspect you aren't actually applying the regex by calling find(). It's a common mistake.
First thing try to escape your dot symbol: ...[0-9]+\.[0-9][0-9]...
because the dot symbol match any character...
Second thing: the [+|-]define a range of characters but it's mandatory...
try [+|-]?
Alban.